Package com.kohlschutter.boilerpipe.sax
Class BoilerpipeHTMLParser
java.lang.Object
org.apache.xerces.parsers.XMLParser
org.apache.xerces.parsers.AbstractXMLDocumentParser
org.apache.xerces.parsers.AbstractSAXParser
com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLParser
- All Implemented Interfaces:
BoilerpipeDocumentSource,org.apache.xerces.xni.XMLDocumentHandler,org.apache.xerces.xni.XMLDTDContentModelHandler,org.apache.xerces.xni.XMLDTDHandler,org.apache.xerces.xs.PSVIProvider,Parser,XMLReader
public class BoilerpipeHTMLParser
extends org.apache.xerces.parsers.AbstractSAXParser
implements BoilerpipeDocumentSource
A simple SAX Parser, used by
BoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser
org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy -
Field Summary
FieldsFields inherited from class org.apache.xerces.parsers.AbstractSAXParser
ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNINGFields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTDFields inherited from class org.apache.xerces.parsers.XMLParser
ENTITY_RESOLVER, ERROR_HANDLER, fConfigurationFields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler
OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCEFields inherited from interface org.apache.xerces.xni.XMLDTDHandler
CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE -
Constructor Summary
ConstructorsModifierConstructorDescriptionConstructs aBoilerpipeHTMLParserusing a default HTML content handler.protectedBoilerpipeHTMLParser(boolean ignore) BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler) Constructs aBoilerpipeHTMLParserusing the givenBoilerpipeHTMLContentHandler. -
Method Summary
Modifier and TypeMethodDescriptionvoidsetContentHandler(BoilerpipeHTMLContentHandler contentHandler) voidsetContentHandler(ContentHandler contentHandler) Returns aTextDocumentcontaining the extractedTextBlocks.Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDeclMethods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDeclMethods inherited from class org.apache.xerces.parsers.XMLParser
parse
-
Field Details
-
contentHandler
-
-
Constructor Details
-
BoilerpipeHTMLParser
public BoilerpipeHTMLParser()Constructs aBoilerpipeHTMLParserusing a default HTML content handler. -
BoilerpipeHTMLParser
Constructs aBoilerpipeHTMLParserusing the givenBoilerpipeHTMLContentHandler.- Parameters:
contentHandler-
-
BoilerpipeHTMLParser
protected BoilerpipeHTMLParser(boolean ignore)
-
-
Method Details
-
setContentHandler
-
setContentHandler
- Specified by:
setContentHandlerin interfaceXMLReader- Overrides:
setContentHandlerin classorg.apache.xerces.parsers.AbstractSAXParser
-
toTextDocument
Returns aTextDocumentcontaining the extractedTextBlocks. NOTE: Only call this afterAbstractSAXParser.parse(org.xml.sax.InputSource).- Specified by:
toTextDocumentin interfaceBoilerpipeDocumentSource- Returns:
- The
TextDocument
-