Package com.kohlschutter.boilerpipe.sax
Class BoilerpipeHTMLContentHandler
java.lang.Object
com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLContentHandler
- All Implemented Interfaces:
ContentHandler
A simple SAX
ContentHandler, used by BoilerpipeSAXInput. Can be used by different
parser implementations, e.g. NekoHTML and TagSoup.-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final String(package private) static final String(package private) intprivate BitSetprivate boolean(package private) LinkedList<Integer> (package private) int(package private) boolean(package private) int(package private) int(package private) LinkedList<LinkedList<LabelAction>> private Stringprivate Stringprivate intprivate static final Pattern(package private) boolean(package private) int(package private) StringBuilderprivate intprivate String(package private) StringBuilder -
Constructor Summary
ConstructorsConstructorDescriptionConstructs aBoilerpipeHTMLContentHandlerusing theDefaultTagActionMap.BoilerpipeHTMLContentHandler(TagActionMap tagActions) Constructs aBoilerpipeHTMLContentHandlerusing the givenTagActionMap. -
Method Summary
Modifier and TypeMethodDescriptionvoidprotected voidvoidvoidcharacters(char[] ch, int start, int length) voidvoidendElement(String uri, String localName, String qName) voidendPrefixMapping(String prefix) voidgetTitle()voidignorableWhitespace(char[] ch, int start, int length) private static booleanvoidprocessingInstruction(String target, String data) voidrecycle()Recycles this instance.voidsetDocumentLocator(Locator locator) voidvoidskippedEntity(String name) voidvoidstartElement(String uri, String localName, String qName, Attributes atts) voidstartPrefixMapping(String prefix, String uri) Returns aTextDocumentcontaining the extractedTextBlocks.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.xml.sax.ContentHandler
declaration
-
Field Details
-
tagActions
-
title
-
ANCHOR_TEXT_START
- See Also:
-
ANCHOR_TEXT_END
- See Also:
-
tokenBuffer
StringBuilder tokenBuffer -
textBuffer
StringBuilder textBuffer -
inBody
int inBody -
inAnchor
int inAnchor -
inIgnorableElement
int inIgnorableElement -
tagLevel
int tagLevel -
blockTagLevel
int blockTagLevel -
sbLastWasWhitespace
boolean sbLastWasWhitespace -
textElementIdx
private int textElementIdx -
textBlocks
-
lastStartTag
-
lastEndTag
-
lastEvent
-
offsetBlocks
private int offsetBlocks -
currentContainedTextElements
-
flush
private boolean flush -
inAnchorText
boolean inAnchorText -
labelStacks
LinkedList<LinkedList<LabelAction>> labelStacks -
fontSizeStack
LinkedList<Integer> fontSizeStack -
PAT_VALID_WORD_CHARACTER
-
-
Constructor Details
-
BoilerpipeHTMLContentHandler
public BoilerpipeHTMLContentHandler()Constructs aBoilerpipeHTMLContentHandlerusing theDefaultTagActionMap. -
BoilerpipeHTMLContentHandler
Constructs aBoilerpipeHTMLContentHandlerusing the givenTagActionMap.- Parameters:
tagActions- TheTagActionMapto use, e.g.DefaultTagActionMap.
-
-
Method Details
-
recycle
public void recycle()Recycles this instance. -
endDocument
- Specified by:
endDocumentin interfaceContentHandler- Throws:
SAXException
-
endPrefixMapping
- Specified by:
endPrefixMappingin interfaceContentHandler- Throws:
SAXException
-
ignorableWhitespace
- Specified by:
ignorableWhitespacein interfaceContentHandler- Throws:
SAXException
-
processingInstruction
- Specified by:
processingInstructionin interfaceContentHandler- Throws:
SAXException
-
setDocumentLocator
- Specified by:
setDocumentLocatorin interfaceContentHandler
-
skippedEntity
- Specified by:
skippedEntityin interfaceContentHandler- Throws:
SAXException
-
startDocument
- Specified by:
startDocumentin interfaceContentHandler- Throws:
SAXException
-
startPrefixMapping
- Specified by:
startPrefixMappingin interfaceContentHandler- Throws:
SAXException
-
startElement
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException - Specified by:
startElementin interfaceContentHandler- Throws:
SAXException
-
endElement
- Specified by:
endElementin interfaceContentHandler- Throws:
SAXException
-
characters
- Specified by:
charactersin interfaceContentHandler- Throws:
SAXException
-
getTextBlocks
-
flushBlock
public void flushBlock() -
addTextBlock
-
isWord
-
getTitle
-
setTitle
-
toTextDocument
Returns aTextDocumentcontaining the extractedTextBlocks. NOTE: Only call this after parsing.- Returns:
- The
TextDocument
-
addWhitespaceIfNecessary
public void addWhitespaceIfNecessary() -
addLabelAction
- Throws:
IllegalStateException
-