Class KeepLargestFulltextBlockFilter
java.lang.Object
com.kohlschutter.boilerpipe.filters.english.HeuristicFilterBase
com.kohlschutter.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
- All Implemented Interfaces:
BoilerpipeFilter
public final class KeepLargestFulltextBlockFilter
extends HeuristicFilterBase
implements BoilerpipeFilter
Keeps the largest
TextBlock only (by the number of words). In case of more than one block
with the same number of words, the first block is chosen. All discarded blocks are marked
"not content" and flagged as DefaultLabels.MIGHT_BE_CONTENT.
As opposed to KeepLargestBlockFilter, the number of words are computed using
HeuristicFilterBase.getNumFullTextWords(TextBlock), which only counts words that occur in
text elements with at least 9 words and are thus believed to be full text.
NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter may
lead to suboptimal results. You better use KeepLargestBlockFilter instead, which works at
the level of number-of-words instead of text densities.-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionbooleanprocess(TextDocument doc) Processes the given documentdoc.Methods inherited from class com.kohlschutter.boilerpipe.filters.english.HeuristicFilterBase
getNumFullTextWords, getNumFullTextWords
-
Field Details
-
INSTANCE
-
-
Constructor Details
-
KeepLargestFulltextBlockFilter
public KeepLargestFulltextBlockFilter()
-
-
Method Details
-
process
Description copied from interface:BoilerpipeFilterProcesses the given documentdoc.- Specified by:
processin interfaceBoilerpipeFilter- Parameters:
doc- TheTextDocumentthat is to be processed.- Returns:
trueif changes have been made to theTextDocument.- Throws:
BoilerpipeProcessingException
-