Package com.kohlschutter.boilerpipe.filters.heuristics
These BoilerpipeFilters are pure heuristics.
-
Class Summary Class Description AddPrecedingLabelsFilter Adds the labels of the preceding block to the current block, optionally adding a prefix.ArticleMetadataFilter Tries to find TextBlocks that comprise of "article metadata".BlockProximityFusion Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.ContentFusion Merges two blocks using some heuristics.DocumentTitleMatchClassifier MarksTextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.ExpandTitleToContentFilter Marks allTextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT.KeepLargestBlockFilter Keeps the largestTextBlockonly (by the number of words).LabelFusion Fuses adjacent blocks if their labels are equal.LargeBlockSameTagLevelToContentFilter Marks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100ListAtEndFilter Marks nested list-item blocks after the end of the main content.SimpleBlockFusionProcessor Merges two subsequent blocks if their text densities are equal.TrailingHeadlineToBoilerplateFilter Marks trailing headlines (TextBlocks that have the labelDefaultLabels.HEADING) as boilerplate.