Class DocumentTitleMatchClassifier

  • All Implemented Interfaces:
    BoilerpipeFilter

    public final class DocumentTitleMatchClassifier
    extends java.lang.Object
    implements BoilerpipeFilter
    Marks TextBlocks which contain parts of the HTML <TITLE> tag, using some heuristics which are quite specific to the news domain.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void addPotentialTitles​(java.util.Set<java.lang.String> potentialTitles, java.lang.String title, java.lang.String pattern, int minWords)  
      private java.lang.String getLongestPart​(java.lang.String title, java.lang.String pattern)  
      java.util.Set<java.lang.String> getPotentialTitles()  
      boolean process​(TextDocument doc)
      Processes the given document doc.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • potentialTitles

        private final java.util.Set<java.lang.String> potentialTitles
      • PAT_REMOVE_CHARACTERS

        private static final java.util.regex.Pattern PAT_REMOVE_CHARACTERS
    • Constructor Detail

      • DocumentTitleMatchClassifier

        public DocumentTitleMatchClassifier​(java.lang.String title)
    • Method Detail

      • getPotentialTitles

        public java.util.Set<java.lang.String> getPotentialTitles()
      • addPotentialTitles

        private void addPotentialTitles​(java.util.Set<java.lang.String> potentialTitles,
                                        java.lang.String title,
                                        java.lang.String pattern,
                                        int minWords)
      • getLongestPart

        private java.lang.String getLongestPart​(java.lang.String title,
                                                java.lang.String pattern)