All Classes and Interfaces
Class
Description
Adds the labels of the preceding block to the current block, optionally adding a prefix.
A full-text extractor which is tuned towards news articles.
Tries to find TextBlocks that comprise of "article metadata".
A full-text extractor which is tuned towards extracting sentences from news articles.
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
Something that can be represented as a
TextDocument.Describes a complete filter pipeline.
A generic
BoilerpipeFilter.A simple SAX
ContentHandler, used by BoilerpipeSAXInput.A simple SAX Parser, used by
BoilerpipeSAXInput.A source that returns
TextDocuments.Exception for signaling failure in the processing pipeline.
Parses an
InputSource using SAX and returns a TextDocument.Removes
TextBlocks which have explicitly been marked as "not content".Provides quick access to common
BoilerpipeExtractors.Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
CommonTagActions for block-level elements, which triggers some LabelAction on
the generated TextBlock.Adds labels to a
TextBlock if the given criteria are met.Merges two blocks using some heuristics.
A quite generic full-text extractor.
Some pre-defined labels which can be used in conjunction with
TextBlock.addLabel(String)
and TextBlock.hasLabel(String).Default
TagActions.Classifies
TextBlocks as content/not-content through rules that have been determined
using the C4.8 machine learning algorithm, as described in the paper
"Boilerplate Detection using Shallow Text Features", particularly using text densities and link
densities.Marks
TextBlocks which contain parts of the HTML <TITLE> tag, using
some heuristics which are quite specific to the news domain.Marks all
TextBlocks "content" which are between the headline and the part that has
already been marked content, if they are marked DefaultLabels.MIGHT_BE_CONTENT.The base class of Extractors.
Base class for some heuristics that are used by boilerpipe filters.
An
InputSourceable for HTMLFetcher.Collection of HTML element information.
Element information.
Unsynchronized list of elements.
A very simple HTTP/HTML fetcher, really just for demo purposes.
Demonstrates how to use Boilerpipe to get the main content, highlighted as HTML.
Highlights text blocks in an HTML document that have been marked as "content" in the
corresponding
TextDocument.Structure to hold information about an element placed in buffer to be comsumed later
Element info for each start element.
Unsynchronized stack of element information.
Marks all blocks as "non-content" that occur after blocks that have been marked
DefaultLabels.INDICATES_END_OF_TEXT.Marks all blocks as "non-content" that occur after blocks that have been marked
DefaultLabels.INDICATES_END_OF_TEXT, and after any content block.Represents an Image resource that is contained in the document.
Extracts the images that are enclosed by extracted content.
Demonstrates how to use Boilerpipe to get the images within the main content.
An InputSourceable can return an arbitrary number of new
InputSources for a given
document.Reverts the "isContent" flag for all
TextBlocksMarks everything as content.
A full-text extractor which extracts the largest text component of a page.
Keeps the largest
TextBlock only (by the number of words).Keeps the largest
TextBlock only (by the number of words).Helps adding labels to
TextBlocks.Fuses adjacent blocks if their labels are equal.
Marks all blocks that contain a given label as "boilerplate".
Marks all blocks that contain a given label as "content".
Marks all blocks as content that:
are on the same tag-level as very likely main content (usually the level of the largest
block)
have a significant number of words, currently: at least 100
A full-text extractor which extracts the largest text component of a page.
Marks nested list-item blocks after the end of the main content.
Marks all blocks as boilerplate.
Marks all blocks as content.
Assigns labels for element CSS classes and ids to the corresponding
TextBlock.Keeps only blocks that have at least one segment fragment ("clause") with at least k
words (default: 5).
Keeps only those content blocks which contain at least k full-text words (measured by
HeuristicFilterBase.getNumFullTextWords(TextBlock)).Keeps only those content blocks which contain at least k words.
Classifies
TextBlocks as content/not-content through rules that have been determined
using the C4.8 machine learning algorithm, as described in the paper
"Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of
words per block and link density per block.A quite generic full-text extractor solely based upon the number of words per block (the current,
the previous and the next block).
Demonstrates how to use Boilerpipe to get the main content as plain text.
Prints debug information about the current state of the TextDocument.
Merges two subsequent blocks if their text densities are equal.
Estimates the "goodness" of a
BoilerpipeExtractor on a given document.Splits TextBlocks at paragraph boundaries.
Marks blocks as "content" if their preceding and following blocks are both already marked
"content", and the given
TextBlockCondition is met.Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
Base class for definition a set of
TagActions that are to be used for the HTML parsing
process.Finds blocks which are potentially indicating the end of an article text and marks them with
DefaultLabels.INDICATES_END_OF_TEXT.Describes a block of text.
Evaluates whether a given
TextBlock meets a certain condition.A text document, consisting of one or more
TextBlocks.Provides shallow statistics on a given
TextDocumentMarks trailing headlines (
TextBlocks that have the label DefaultLabels.HEADING)
as boilerplate.Tokenizes text according to Unicode word boundaries and strips off non-word characters.
Demonstrates how to use Boilerpipe when working with
InputSources.