Class KeepLargestFulltextBlockFilter

java.lang.Object
com.kohlschutter.boilerpipe.filters.english.HeuristicFilterBase
com.kohlschutter.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
All Implemented Interfaces:
BoilerpipeFilter

public final class KeepLargestFulltextBlockFilter extends HeuristicFilterBase implements BoilerpipeFilter
Keeps the largest TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as DefaultLabels.MIGHT_BE_CONTENT. As opposed to KeepLargestBlockFilter, the number of words are computed using HeuristicFilterBase.getNumFullTextWords(TextBlock), which only counts words that occur in text elements with at least 9 words and are thus believed to be full text. NOTE: Without language-specific fine-tuning (i.e., running the default instance), this filter may lead to suboptimal results. You better use KeepLargestBlockFilter instead, which works at the level of number-of-words instead of text densities.