Class ExtractorBase
java.lang.Object
com.kohlschutter.boilerpipe.extractors.ExtractorBase
- All Implemented Interfaces:
BoilerpipeExtractor, BoilerpipeFilter
- Direct Known Subclasses:
ArticleExtractor, ArticleSentencesExtractor, CanolaExtractor, DefaultExtractor, KeepEverythingExtractor, KeepEverythingWithMinKWordsExtractor, LargestContentExtractor, NumWordsRulesExtractor
The base class of Extractors. Also provides some helper methods to quickly retrieve the text that
remained after processing.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiongetText(TextDocument doc) Extracts text from the givenTextDocumentobject.Extracts text from the HTML code available from the givenReader.Extracts text from the HTML code given as a String.Extracts text from the HTML code available from the givenURL.getText(InputSource is) Extracts text from the HTML code available from the givenInputSource.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface BoilerpipeFilter
process
-
Constructor Details
-
ExtractorBase
public ExtractorBase()
-
-
Method Details
-
getText
Extracts text from the HTML code given as a String.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
html- The HTML code as a String.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
Extracts text from the HTML code available from the givenInputSource.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
is- The InputSource containing the HTML- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
Extracts text from the HTML code available from the givenURL. NOTE: This method is mainly to be used for show case purposes. If you are going to crawl the Web, consider usinggetText(InputSource)instead.- Parameters:
url- The URL pointing to the HTML code.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
Extracts text from the HTML code available from the givenReader.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
r- The Reader containing the HTML- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-
getText
Extracts text from the givenTextDocumentobject.- Specified by:
getTextin interfaceBoilerpipeExtractor- Parameters:
doc- TheTextDocument.- Returns:
- The extracted text.
- Throws:
BoilerpipeProcessingException
-