All Classes and Interfaces

Class
Description
Adds the labels of the preceding block to the current block, optionally adding a prefix.
A full-text extractor which is tuned towards news articles.
Tries to find TextBlocks that comprise of "article metadata".
A full-text extractor which is tuned towards extracting sentences from news articles.
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
Something that can be represented as a TextDocument.
Describes a complete filter pipeline.
A generic BoilerpipeFilter.
A simple SAX ContentHandler, used by BoilerpipeSAXInput.
 
A simple SAX Parser, used by BoilerpipeSAXInput.
A source that returns TextDocuments.
Exception for signaling failure in the processing pipeline.
Parses an InputSource using SAX and returns a TextDocument.
Removes TextBlocks which have explicitly been marked as "not content".
A full-text extractor trained on krdwrd Canola .
Provides quick access to common BoilerpipeExtractors.
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
CommonTagActions for block-level elements, which triggers some LabelAction on the generated TextBlock.
 
CommonTagActions for inline elements, which triggers some LabelAction on the generated TextBlock.
Adds labels to a TextBlock if the given criteria are met.
Merges two blocks using some heuristics.
A quite generic full-text extractor.
Some pre-defined labels which can be used in conjunction with TextBlock.addLabel(String) and TextBlock.hasLabel(String).
Default TagActions.
Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.
Marks TextBlocks which contain parts of the HTML <TITLE> tag, using some heuristics which are quite specific to the news domain.
Marks all TextBlocks "content" which are between the headline and the part that has already been marked content, if they are marked DefaultLabels.MIGHT_BE_CONTENT.
The base class of Extractors.
Base class for some heuristics that are used by boilerpipe filters.
Collection of HTML element information.
Element information.
Unsynchronized list of elements.
A very simple HTTP/HTML fetcher, really just for demo purposes.
Demonstrates how to use Boilerpipe to get the main content, highlighted as HTML.
Highlights text blocks in an HTML document that have been marked as "content" in the corresponding TextDocument.
 
 
Structure to hold information about an element placed in buffer to be comsumed later
Element info for each start element.
Unsynchronized stack of element information.
Marks all blocks as "non-content" that occur after blocks that have been marked DefaultLabels.INDICATES_END_OF_TEXT.
Marks all blocks as "non-content" that occur after blocks that have been marked DefaultLabels.INDICATES_END_OF_TEXT, and after any content block.
Represents an Image resource that is contained in the document.
Extracts the images that are enclosed by extracted content.
 
Demonstrates how to use Boilerpipe to get the images within the main content.
An InputSourceable can return an arbitrary number of new InputSources for a given document.
Reverts the "isContent" flag for all TextBlocks
Marks everything as content.
A full-text extractor which extracts the largest text component of a page.
Keeps the largest TextBlock only (by the number of words).
Keeps the largest TextBlock only (by the number of words).
Helps adding labels to TextBlocks.
Fuses adjacent blocks if their labels are equal.
Marks all blocks that contain a given label as "boilerplate".
Marks all blocks that contain a given label as "content".
Marks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100
A full-text extractor which extracts the largest text component of a page.
Marks nested list-item blocks after the end of the main content.
Marks all blocks as boilerplate.
Marks all blocks as content.
Assigns labels for element CSS classes and ids to the corresponding TextBlock.
Keeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).
Keeps only those content blocks which contain at least k full-text words (measured by HeuristicFilterBase.getNumFullTextWords(TextBlock)).
Keeps only those content blocks which contain at least k words.
Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
Demonstrates how to use Boilerpipe to get the main content as plain text.
Prints debug information about the current state of the TextDocument.
Merges two subsequent blocks if their text densities are equal.
Estimates the "goodness" of a BoilerpipeExtractor on a given document.
Splits TextBlocks at paragraph boundaries.
Marks blocks as "content" if their preceding and following blocks are both already marked "content", and the given TextBlockCondition is met.
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
Base class for definition a set of TagActions that are to be used for the HTML parsing process.
Finds blocks which are potentially indicating the end of an article text and marks them with DefaultLabels.INDICATES_END_OF_TEXT.
Describes a block of text.
Evaluates whether a given TextBlock meets a certain condition.
A text document, consisting of one or more TextBlocks.
Provides shallow statistics on a given TextDocument
Marks trailing headlines (TextBlocks that have the label DefaultLabels.HEADING) as boilerplate.
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
Demonstrates how to use Boilerpipe when working with InputSources.