Package com.kohlschutter.boilerpipe.sax
Class BoilerpipeHTMLContentHandler
- java.lang.Object
-
- com.kohlschutter.boilerpipe.sax.BoilerpipeHTMLContentHandler
-
- All Implemented Interfaces:
org.xml.sax.ContentHandler
public class BoilerpipeHTMLContentHandler extends java.lang.Object implements org.xml.sax.ContentHandlerA simple SAXContentHandler, used byBoilerpipeSAXInput. Can be used by different parser implementations, e.g. NekoHTML and TagSoup.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static classBoilerpipeHTMLContentHandler.Event
-
Field Summary
Fields Modifier and Type Field Description (package private) static java.lang.StringANCHOR_TEXT_END(package private) static java.lang.StringANCHOR_TEXT_START(package private) intblockTagLevelprivate java.util.BitSetcurrentContainedTextElementsprivate booleanflush(package private) java.util.LinkedList<java.lang.Integer>fontSizeStack(package private) intinAnchor(package private) booleaninAnchorText(package private) intinBody(package private) intinIgnorableElement(package private) java.util.LinkedList<java.util.LinkedList<LabelAction>>labelStacksprivate java.lang.StringlastEndTagprivate BoilerpipeHTMLContentHandler.EventlastEventprivate java.lang.StringlastStartTagprivate intoffsetBlocksprivate static java.util.regex.PatternPAT_VALID_WORD_CHARACTER(package private) booleansbLastWasWhitespaceprivate java.util.Map<java.lang.String,TagAction>tagActions(package private) inttagLevelprivate java.util.List<TextBlock>textBlocks(package private) java.lang.StringBuildertextBufferprivate inttextElementIdxprivate java.lang.Stringtitle(package private) java.lang.StringBuildertokenBuffer
-
Constructor Summary
Constructors Constructor Description BoilerpipeHTMLContentHandler()Constructs aBoilerpipeHTMLContentHandlerusing theDefaultTagActionMap.BoilerpipeHTMLContentHandler(TagActionMap tagActions)Constructs aBoilerpipeHTMLContentHandlerusing the givenTagActionMap.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddLabelAction(LabelAction la)protected voidaddTextBlock(TextBlock tb)voidaddWhitespaceIfNecessary()voidcharacters(char[] ch, int start, int length)voidendDocument()voidendElement(java.lang.String uri, java.lang.String localName, java.lang.String qName)voidendPrefixMapping(java.lang.String prefix)voidflushBlock()(package private) java.util.List<TextBlock>getTextBlocks()java.lang.StringgetTitle()voidignorableWhitespace(char[] ch, int start, int length)private static booleanisWord(java.lang.String token)voidprocessingInstruction(java.lang.String target, java.lang.String data)voidrecycle()Recycles this instance.voidsetDocumentLocator(org.xml.sax.Locator locator)voidsetTitle(java.lang.String s)voidskippedEntity(java.lang.String name)voidstartDocument()voidstartElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts)voidstartPrefixMapping(java.lang.String prefix, java.lang.String uri)TextDocumenttoTextDocument()Returns aTextDocumentcontaining the extractedTextBlocks.
-
-
-
Field Detail
-
tagActions
private final java.util.Map<java.lang.String,TagAction> tagActions
-
title
private java.lang.String title
-
ANCHOR_TEXT_START
static final java.lang.String ANCHOR_TEXT_START
- See Also:
- Constant Field Values
-
ANCHOR_TEXT_END
static final java.lang.String ANCHOR_TEXT_END
- See Also:
- Constant Field Values
-
tokenBuffer
java.lang.StringBuilder tokenBuffer
-
textBuffer
java.lang.StringBuilder textBuffer
-
inBody
int inBody
-
inAnchor
int inAnchor
-
inIgnorableElement
int inIgnorableElement
-
tagLevel
int tagLevel
-
blockTagLevel
int blockTagLevel
-
sbLastWasWhitespace
boolean sbLastWasWhitespace
-
textElementIdx
private int textElementIdx
-
textBlocks
private final java.util.List<TextBlock> textBlocks
-
lastStartTag
private java.lang.String lastStartTag
-
lastEndTag
private java.lang.String lastEndTag
-
lastEvent
private BoilerpipeHTMLContentHandler.Event lastEvent
-
offsetBlocks
private int offsetBlocks
-
currentContainedTextElements
private java.util.BitSet currentContainedTextElements
-
flush
private boolean flush
-
inAnchorText
boolean inAnchorText
-
labelStacks
java.util.LinkedList<java.util.LinkedList<LabelAction>> labelStacks
-
fontSizeStack
java.util.LinkedList<java.lang.Integer> fontSizeStack
-
PAT_VALID_WORD_CHARACTER
private static final java.util.regex.Pattern PAT_VALID_WORD_CHARACTER
-
-
Constructor Detail
-
BoilerpipeHTMLContentHandler
public BoilerpipeHTMLContentHandler()
Constructs aBoilerpipeHTMLContentHandlerusing theDefaultTagActionMap.
-
BoilerpipeHTMLContentHandler
public BoilerpipeHTMLContentHandler(TagActionMap tagActions)
Constructs aBoilerpipeHTMLContentHandlerusing the givenTagActionMap.- Parameters:
tagActions- TheTagActionMapto use, e.g.DefaultTagActionMap.
-
-
Method Detail
-
recycle
public void recycle()
Recycles this instance.
-
endDocument
public void endDocument() throws org.xml.sax.SAXException- Specified by:
endDocumentin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
endPrefixMapping
public void endPrefixMapping(java.lang.String prefix) throws org.xml.sax.SAXException- Specified by:
endPrefixMappingin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
ignorableWhitespace
public void ignorableWhitespace(char[] ch, int start, int length) throws org.xml.sax.SAXException- Specified by:
ignorableWhitespacein interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
processingInstruction
public void processingInstruction(java.lang.String target, java.lang.String data) throws org.xml.sax.SAXException- Specified by:
processingInstructionin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
setDocumentLocator
public void setDocumentLocator(org.xml.sax.Locator locator)
- Specified by:
setDocumentLocatorin interfaceorg.xml.sax.ContentHandler
-
skippedEntity
public void skippedEntity(java.lang.String name) throws org.xml.sax.SAXException- Specified by:
skippedEntityin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
startDocument
public void startDocument() throws org.xml.sax.SAXException- Specified by:
startDocumentin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
startPrefixMapping
public void startPrefixMapping(java.lang.String prefix, java.lang.String uri) throws org.xml.sax.SAXException- Specified by:
startPrefixMappingin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
startElement
public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, org.xml.sax.Attributes atts) throws org.xml.sax.SAXException- Specified by:
startElementin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
endElement
public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws org.xml.sax.SAXException- Specified by:
endElementin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
characters
public void characters(char[] ch, int start, int length) throws org.xml.sax.SAXException- Specified by:
charactersin interfaceorg.xml.sax.ContentHandler- Throws:
org.xml.sax.SAXException
-
getTextBlocks
java.util.List<TextBlock> getTextBlocks()
-
flushBlock
public void flushBlock()
-
addTextBlock
protected void addTextBlock(TextBlock tb)
-
isWord
private static boolean isWord(java.lang.String token)
-
getTitle
public java.lang.String getTitle()
-
setTitle
public void setTitle(java.lang.String s)
-
toTextDocument
public TextDocument toTextDocument()
Returns aTextDocumentcontaining the extractedTextBlocks. NOTE: Only call this after parsing.- Returns:
- The
TextDocument
-
addWhitespaceIfNecessary
public void addWhitespaceIfNecessary()
-
addLabelAction
public void addLabelAction(LabelAction la) throws java.lang.IllegalStateException
- Throws:
java.lang.IllegalStateException
-
-