Package com.kohlschutter.boilerpipe.filters.heuristics
package com.kohlschutter.boilerpipe.filters.heuristics
These BoilerpipeFilters are pure heuristics.
-
ClassesClassDescriptionAdds the labels of the preceding block to the current block, optionally adding a prefix.Tries to find TextBlocks that comprise of "article metadata".Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.Merges two blocks using some heuristics.Marks
TextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.Marks allTextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT.Keeps the largestTextBlockonly (by the number of words).Fuses adjacent blocks if their labels are equal.Marks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100Marks nested list-item blocks after the end of the main content.Merges two subsequent blocks if their text densities are equal.Marks trailing headlines (TextBlocks that have the labelDefaultLabels.HEADING) as boilerplate.