Uses of Interface
com.kohlschutter.boilerpipe.BoilerpipeFilter
-
Packages that use BoilerpipeFilter Package Description com.kohlschutter.boilerpipe The Boilerpipe top-level package.com.kohlschutter.boilerpipe.extractors Some standard extractors (i.e., completely piped BoilerpipeFilters)com.kohlschutter.boilerpipe.filters.debug com.kohlschutter.boilerpipe.filters.english These BoilerpipeFilters have only been tested on English text.com.kohlschutter.boilerpipe.filters.heuristics These BoilerpipeFilters are pure heuristics.com.kohlschutter.boilerpipe.filters.simple These BoilerpipeFilters are straight-forward and probably not really specific to English. -
-
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe
Subinterfaces of BoilerpipeFilter in com.kohlschutter.boilerpipe Modifier and Type Interface Description interfaceBoilerpipeExtractorDescribes a complete filter pipeline. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.extractors
Classes in com.kohlschutter.boilerpipe.extractors that implement BoilerpipeFilter Modifier and Type Class Description classArticleExtractorA full-text extractor which is tuned towards news articles.classArticleSentencesExtractorA full-text extractor which is tuned towards extracting sentences from news articles.classCanolaExtractorclassDefaultExtractorA quite generic full-text extractor.classExtractorBaseThe base class of Extractors.classKeepEverythingExtractorMarks everything as content.classKeepEverythingWithMinKWordsExtractorA full-text extractor which extracts the largest text component of a page.classLargestContentExtractorA full-text extractor which extracts the largest text component of a page.classNumWordsRulesExtractorA quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).Fields in com.kohlschutter.boilerpipe.extractors declared as BoilerpipeFilter Modifier and Type Field Description static BoilerpipeFilterCanolaExtractor. CLASSIFIERThe actual classifier, exposed. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.debug
Classes in com.kohlschutter.boilerpipe.filters.debug that implement BoilerpipeFilter Modifier and Type Class Description classPrintDebugFilterPrints debug information about the current state of the TextDocument. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.english
Classes in com.kohlschutter.boilerpipe.filters.english that implement BoilerpipeFilter Modifier and Type Class Description classDensityRulesClassifierClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.classIgnoreBlocksAfterContentFilterMarks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT.classIgnoreBlocksAfterContentFromEndFilterMarks all blocks as "non-content" that occur after blocks that have been markedDefaultLabels.INDICATES_END_OF_TEXT, and after any content block.classKeepLargestFulltextBlockFilterKeeps the largestTextBlockonly (by the number of words).classMinFulltextWordsFilterKeeps only those content blocks which contain at least k full-text words (measured byHeuristicFilterBase.getNumFullTextWords(TextBlock)).classNumWordsRulesClassifierClassifiesTextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.classTerminatingBlocksFinderFinds blocks which are potentially indicating the end of an article text and marks them withDefaultLabels.INDICATES_END_OF_TEXT. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.heuristics
Classes in com.kohlschutter.boilerpipe.filters.heuristics that implement BoilerpipeFilter Modifier and Type Class Description classAddPrecedingLabelsFilterAdds the labels of the preceding block to the current block, optionally adding a prefix.classArticleMetadataFilterTries to find TextBlocks that comprise of "article metadata".classBlockProximityFusionFuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.classContentFusionMerges two blocks using some heuristics.classDocumentTitleMatchClassifierMarksTextBlocks which contain parts of the HTML<TITLE>tag, using some heuristics which are quite specific to the news domain.classExpandTitleToContentFilterMarks allTextBlocks "content" which are between the headline and the part that has already been marked content, if they are markedDefaultLabels.MIGHT_BE_CONTENT.classKeepLargestBlockFilterKeeps the largestTextBlockonly (by the number of words).classLabelFusionFuses adjacent blocks if their labels are equal.classLargeBlockSameTagLevelToContentFilterMarks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100classListAtEndFilterMarks nested list-item blocks after the end of the main content.classSimpleBlockFusionProcessorMerges two subsequent blocks if their text densities are equal.classTrailingHeadlineToBoilerplateFilterMarks trailing headlines (TextBlocks that have the labelDefaultLabels.HEADING) as boilerplate. -
Uses of BoilerpipeFilter in com.kohlschutter.boilerpipe.filters.simple
Classes in com.kohlschutter.boilerpipe.filters.simple that implement BoilerpipeFilter Modifier and Type Class Description classBoilerplateBlockFilterRemovesTextBlocks which have explicitly been marked as "not content".classInvertedFilterReverts the "isContent" flag for allTextBlocksclassLabelToBoilerplateFilterMarks all blocks that contain a given label as "boilerplate".classLabelToContentFilterMarks all blocks that contain a given label as "content".classMarkEverythingBoilerplateFilterMarks all blocks as boilerplate.classMarkEverythingContentFilterMarks all blocks as content.classMinClauseWordsFilterKeeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).classMinWordsFilterKeeps only those content blocks which contain at least k words.classSplitParagraphBlocksFilterSplits TextBlocks at paragraph boundaries.classSurroundingToContentFilterMarks blocks as "content" if their preceding and following blocks are both already marked "content", and the givenTextBlockConditionis met.
-