All Classes Interface Summary Class Summary Enum Summary Exception Summary
| Class |
Description |
| AddPrecedingLabelsFilter |
Adds the labels of the preceding block to the current block, optionally adding a prefix.
|
| ArticleExtractor |
A full-text extractor which is tuned towards news articles.
|
| ArticleMetadataFilter |
Tries to find TextBlocks that comprise of "article metadata".
|
| ArticleSentencesExtractor |
A full-text extractor which is tuned towards extracting sentences from news articles.
|
| BlockProximityFusion |
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
|
| BoilerpipeDocumentSource |
|
| BoilerpipeExtractor |
Describes a complete filter pipeline.
|
| BoilerpipeFilter |
|
| BoilerpipeHTMLContentHandler |
|
| BoilerpipeHTMLContentHandler.Event |
|
| BoilerpipeHTMLParser |
|
| BoilerpipeInput |
|
| BoilerpipeProcessingException |
Exception for signaling failure in the processing pipeline.
|
| BoilerpipeSAXInput |
|
| BoilerplateBlockFilter |
Removes TextBlocks which have explicitly been marked as "not content".
|
| CanolaExtractor |
|
| CommonExtractors |
|
| CommonTagActions |
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
|
| CommonTagActions.BlockTagLabelAction |
|
| CommonTagActions.Chained |
|
| CommonTagActions.InlineTagLabelAction |
|
| ConditionalLabelAction |
Adds labels to a TextBlock if the given criteria are met.
|
| ContentFusion |
Merges two blocks using some heuristics.
|
| DefaultExtractor |
A quite generic full-text extractor.
|
| DefaultLabels |
|
| DefaultTagActionMap |
|
| DensityRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have been determined
using the C4.8 machine learning algorithm, as described in the paper
"Boilerplate Detection using Shallow Text Features", particularly using text densities and link
densities.
|
| DocumentTitleMatchClassifier |
Marks TextBlocks which contain parts of the HTML <TITLE> tag, using
some heuristics which are quite specific to the news domain.
|
| ExpandTitleToContentFilter |
|
| ExtractorBase |
The base class of Extractors.
|
| HeuristicFilterBase |
Base class for some heuristics that are used by boilerpipe filters.
|
| HTMLDocument |
|
| HTMLElements |
Collection of HTML element information.
|
| HTMLElements.Element |
Element information.
|
| HTMLElements.ElementList |
Unsynchronized list of elements.
|
| HTMLFetcher |
A very simple HTTP/HTML fetcher, really just for demo purposes.
|
| HTMLHighlightDemo |
Demonstrates how to use Boilerpipe to get the main content, highlighted as HTML.
|
| HTMLHighlighter |
Highlights text blocks in an HTML document that have been marked as "content" in the
corresponding TextDocument.
|
| HTMLHighlighter.TagAction |
|
| HTMLTagBalancer |
|
| HTMLTagBalancer.ElementEntry |
Structure to hold information about an element placed in buffer to be comsumed later
|
| HTMLTagBalancer.Info |
Element info for each start element.
|
| HTMLTagBalancer.InfoStack |
Unsynchronized stack of element information.
|
| IgnoreBlocksAfterContentFilter |
|
| IgnoreBlocksAfterContentFromEndFilter |
|
| Image |
Represents an Image resource that is contained in the document.
|
| ImageExtractor |
Extracts the images that are enclosed by extracted content.
|
| ImageExtractor.TagAction |
|
| ImageExtractorDemo |
Demonstrates how to use Boilerpipe to get the images within the main content.
|
| InputSourceable |
An InputSourceable can return an arbitrary number of new InputSources for a given
document.
|
| InvertedFilter |
Reverts the "isContent" flag for all TextBlocks
|
| KeepEverythingExtractor |
Marks everything as content.
|
| KeepEverythingWithMinKWordsExtractor |
A full-text extractor which extracts the largest text component of a page.
|
| KeepLargestBlockFilter |
Keeps the largest TextBlock only (by the number of words).
|
| KeepLargestFulltextBlockFilter |
Keeps the largest TextBlock only (by the number of words).
|
| LabelAction |
|
| LabelFusion |
Fuses adjacent blocks if their labels are equal.
|
| LabelToBoilerplateFilter |
Marks all blocks that contain a given label as "boilerplate".
|
| LabelToContentFilter |
Marks all blocks that contain a given label as "content".
|
| LargeBlockSameTagLevelToContentFilter |
Marks all blocks as content that:
are on the same tag-level as very likely main content (usually the level of the largest
block)
have a significant number of words, currently: at least 100
|
| LargestContentExtractor |
A full-text extractor which extracts the largest text component of a page.
|
| ListAtEndFilter |
Marks nested list-item blocks after the end of the main content.
|
| MarkEverythingBoilerplateFilter |
Marks all blocks as boilerplate.
|
| MarkEverythingContentFilter |
Marks all blocks as content.
|
| MarkupTagAction |
Assigns labels for element CSS classes and ids to the corresponding TextBlock.
|
| MinClauseWordsFilter |
Keeps only blocks that have at least one segment fragment ("clause") with at least k
words (default: 5).
|
| MinFulltextWordsFilter |
|
| MinWordsFilter |
Keeps only those content blocks which contain at least k words.
|
| NumWordsRulesClassifier |
Classifies TextBlocks as content/not-content through rules that have been determined
using the C4.8 machine learning algorithm, as described in the paper
"Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of
words per block and link density per block.
|
| NumWordsRulesExtractor |
A quite generic full-text extractor solely based upon the number of words per block (the current,
the previous and the next block).
|
| Oneliner |
Demonstrates how to use Boilerpipe to get the main content as plain text.
|
| PrintDebugFilter |
Prints debug information about the current state of the TextDocument.
|
| SimpleBlockFusionProcessor |
Merges two subsequent blocks if their text densities are equal.
|
| SimpleEstimator |
|
| SplitParagraphBlocksFilter |
Splits TextBlocks at paragraph boundaries.
|
| SurroundingToContentFilter |
Marks blocks as "content" if their preceding and following blocks are both already marked
"content", and the given TextBlockCondition is met.
|
| TagAction |
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
|
| TagActionMap |
Base class for definition a set of TagActions that are to be used for the HTML parsing
process.
|
| TerminatingBlocksFinder |
|
| TextBlock |
Describes a block of text.
|
| TextBlockCondition |
Evaluates whether a given TextBlock meets a certain condition.
|
| TextDocument |
A text document, consisting of one or more TextBlocks.
|
| TextDocumentStatistics |
|
| TrailingHeadlineToBoilerplateFilter |
|
| UnicodeTokenizer |
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
|
| UsingSAX |
Demonstrates how to use Boilerpipe when working with InputSources.
|