Class BoilerpipeHTMLParser

java.lang.Object
org.apache.xerces.parsers.XMLParser
org.apache.xerces.parsers.AbstractXMLDocumentParser
org.apache.xerces.parsers.AbstractSAXParser
com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLParser
All Implemented Interfaces:
BoilerpipeDocumentSource, org.apache.xerces.xni.XMLDocumentHandler, org.apache.xerces.xni.XMLDTDContentModelHandler, org.apache.xerces.xni.XMLDTDHandler, org.apache.xerces.xs.PSVIProvider, Parser, XMLReader

public class BoilerpipeHTMLParser extends org.apache.xerces.parsers.AbstractSAXParser implements BoilerpipeDocumentSource
A simple SAX Parser, used by BoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.
  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.xerces.parsers.AbstractSAXParser

    org.apache.xerces.parsers.AbstractSAXParser.AttributesProxy, org.apache.xerces.parsers.AbstractSAXParser.LocatorProxy
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
     

    Fields inherited from class org.apache.xerces.parsers.AbstractSAXParser

    ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING

    Fields inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser

    fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD

    Fields inherited from class org.apache.xerces.parsers.XMLParser

    ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration

    Fields inherited from interface org.apache.xerces.xni.XMLDTDContentModelHandler

    OCCURS_ONE_OR_MORE, OCCURS_ZERO_OR_MORE, OCCURS_ZERO_OR_ONE, SEPARATOR_CHOICE, SEPARATOR_SEQUENCE

    Fields inherited from interface org.apache.xerces.xni.XMLDTDHandler

    CONDITIONAL_IGNORE, CONDITIONAL_INCLUDE
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
     
    Constructs a BoilerpipeHTMLParser using a default HTML content handler.
    protected
    BoilerpipeHTMLParser(boolean ignore)
     
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
     
    void
     
    Returns a TextDocument containing the extracted TextBlock s.

    Methods inherited from class org.apache.xerces.parsers.AbstractSAXParser

    attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl

    Methods inherited from class org.apache.xerces.parsers.AbstractXMLDocumentParser

    any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl

    Methods inherited from class org.apache.xerces.parsers.XMLParser

    parse

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait