Class DocumentTitleMatchClassifier
- java.lang.Object
-
- com.kohlschutter.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
-
- All Implemented Interfaces:
BoilerpipeFilter
public final class DocumentTitleMatchClassifier extends java.lang.Object implements BoilerpipeFilter
MarksTextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.
-
-
Field Summary
Fields Modifier and Type Field Description private static java.util.regex.PatternPAT_REMOVE_CHARACTERSprivate java.util.Set<java.lang.String>potentialTitles
-
Constructor Summary
Constructors Constructor Description DocumentTitleMatchClassifier(java.lang.String title)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private voidaddPotentialTitles(java.util.Set<java.lang.String> potentialTitles, java.lang.String title, java.lang.String pattern, int minWords)private java.lang.StringgetLongestPart(java.lang.String title, java.lang.String pattern)java.util.Set<java.lang.String>getPotentialTitles()booleanprocess(TextDocument doc)Processes the given documentdoc.
-
-
-
Method Detail
-
getPotentialTitles
public java.util.Set<java.lang.String> getPotentialTitles()
-
addPotentialTitles
private void addPotentialTitles(java.util.Set<java.lang.String> potentialTitles, java.lang.String title, java.lang.String pattern, int minWords)
-
getLongestPart
private java.lang.String getLongestPart(java.lang.String title, java.lang.String pattern)
-
process
public boolean process(TextDocument doc) throws BoilerpipeProcessingException
Description copied from interface:BoilerpipeFilterProcesses the given documentdoc.- Specified by:
processin interfaceBoilerpipeFilter- Parameters:
doc- TheTextDocumentthat is to be processed.- Returns:
trueif changes have been made to theTextDocument.- Throws:
BoilerpipeProcessingException
-
-