Uses of Interface
com.kohlschutter.boilerpipe.BoilerpipeFilter
Packages that use BoilerpipeFilter
Package
Description
The Boilerpipe top-level package.
Some standard extractors (i.e., completely piped BoilerpipeFilters)
These BoilerpipeFilters have only been tested on English text.
These BoilerpipeFilters are pure heuristics.
These BoilerpipeFilters are straight-forward and probably not really specific to English.
-
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe
Subinterfaces of BoilerpipeFilter in com.kohlschutter.boilerpipeModifier and TypeInterfaceDescriptioninterfaceDescribes a complete filter pipeline. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.extractors
Classes in com.kohlschutter.boilerpipe.extractors that implement BoilerpipeFilterModifier and TypeClassDescriptionfinal classA full-text extractor which is tuned towards news articles.final classA full-text extractor which is tuned towards extracting sentences from news articles.classclassA quite generic full-text extractor.classThe base class of Extractors.final classMarks everything as content.final classA full-text extractor which extracts the largest text component of a page.final classA full-text extractor which extracts the largest text component of a page.classA quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).Fields in com.kohlschutter.boilerpipe.extractors declared as BoilerpipeFilterModifier and TypeFieldDescriptionstatic final BoilerpipeFilterCanolaExtractor.CLASSIFIERThe actual classifier, exposed. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.debug
Classes in com.kohlschutter.boilerpipe.filters.debug that implement BoilerpipeFilterModifier and TypeClassDescriptionfinal classPrints debug information about the current state of the TextDocument. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.english
Classes in com.kohlschutter.boilerpipe.filters.english that implement BoilerpipeFilterModifier and TypeClassDescriptionclassClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.final classMarks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT.final classMarks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT, and after any content block.final classKeeps the largestTextBlockonly (by the number of words).final classKeeps only those content blocks which contain at least k full-text words (measured byHeuristicFilterBase.getNumFullTextWords(TextBlock)).classClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.classFinds blocks which are potentially indicating the end of an article text and marks them withDefaultLabels.INDICATES_END_OF_TEXT. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.heuristics
Classes in com.kohlschutter.boilerpipe.filters.heuristics that implement BoilerpipeFilterModifier and TypeClassDescriptionfinal classAdds the labels of the preceding block to the current block, optionally adding a prefix.classTries to find TextBlocks that comprise of "article metadata".final classFuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.final classMerges two blocks using some heuristics.final classMarksTextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.final classMarks allTextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT.final classKeeps the largestTextBlockonly (by the number of words).final classFuses adjacent blocks if their labels are equal.final classMarks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100final classMarks nested list-item blocks after the end of the main content.classMerges two subsequent blocks if their text densities are equal.final classMarks trailing headlines (TextBlocks that have the labelDefaultLabels.HEADING) as boilerplate. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.simple
Classes in com.kohlschutter.boilerpipe.filters.simple that implement BoilerpipeFilterModifier and TypeClassDescriptionfinal classRemovesTextBlocks which have explicitly been marked as "not content".final classReverts the "isContent" flag for allTextBlocksfinal classMarks all blocks that contain a given label as "boilerplate".final classMarks all blocks that contain a given label as "content".final classMarks all blocks as boilerplate.final classMarks all blocks as content.final classKeeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).final classKeeps only those content blocks which contain at least k words.final classSplits TextBlocks at paragraph boundaries.classMarks blocks as "content" if their preceding and following blocks are both already marked "content", and the givenTextBlockConditionis met.