Package org.apache.pdfbox.tools
Class PDFText2HTML
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.PDFTextStripper
-
- org.apache.pdfbox.tools.PDFText2HTML
-
public class PDFText2HTML extends org.apache.pdfbox.text.PDFTextStripperWrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.- Author:
- John J Barton
-
-
Constructor Summary
Constructors Constructor Description PDFText2HTML()Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)protected voidendArticle()Write out the article separator.voidendDocument(org.apache.pdfbox.pdmodel.PDDocument document)protected java.lang.StringgetTitle()This method will attempt to guess the title of the document using either the document properties or the first lines of text.protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, java.lang.String arg3, org.apache.pdfbox.util.Vector arg4)protected voidstartArticle(boolean isLTR)Write out the article separator (div tag) with proper text direction information.protected voidstartDocument(org.apache.pdfbox.pdmodel.PDDocument document)protected voidwriteHeader()Deprecated.protected voidwriteParagraphEnd()Writes the paragraph end "</p>" to the output.protected voidwriteString(java.lang.String chars)Write a string to the output stream and escape some HTML characters.protected voidwriteString(java.lang.String text, java.util.List<org.apache.pdfbox.text.TextPosition> textPositions)Write a string to the output stream, maintain font state, and escape some HTML characters.-
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Method Detail
-
writeHeader
@Deprecated protected void writeHeader() throws java.io.IOExceptionDeprecated.Write the header to the output document. Now also writes the tag defining the character encoding.- Throws:
java.io.IOException- If there is a problem writing out the header to the document.
-
startDocument
protected void startDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws java.io.IOException- Overrides:
startDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
java.io.IOException
-
endDocument
public void endDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws java.io.IOException- Overrides:
endDocumentin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
java.io.IOException
-
getTitle
protected java.lang.String getTitle()
This method will attempt to guess the title of the document using either the document properties or the first lines of text.- Returns:
- returns the title.
-
startArticle
protected void startArticle(boolean isLTR) throws java.io.IOExceptionWrite out the article separator (div tag) with proper text direction information.- Overrides:
startArticlein classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
isLTR- true if direction of text is left to right- Throws:
java.io.IOException- If there is an error writing to the stream.
-
endArticle
protected void endArticle() throws java.io.IOExceptionWrite out the article separator.- Overrides:
endArticlein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String text, java.util.List<org.apache.pdfbox.text.TextPosition> textPositions) throws java.io.IOExceptionWrite a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
text- The text to write to the stream.textPositions- the corresponding text positions- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeString
protected void writeString(java.lang.String chars) throws java.io.IOExceptionWrite a string to the output stream and escape some HTML characters.- Overrides:
writeStringin classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
chars- String to be written to the stream- Throws:
java.io.IOException- If there is an error writing to the stream.
-
writeParagraphEnd
protected void writeParagraphEnd() throws java.io.IOExceptionWrites the paragraph end "</p>" to the output. Furthermore, it will also clear the font state.- Overrides:
writeParagraphEndin classorg.apache.pdfbox.text.PDFTextStripper- Throws:
java.io.IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, java.lang.String arg3, org.apache.pdfbox.util.Vector arg4) throws java.io.IOException- Overrides:
showGlyphin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
java.io.IOException
-
computeFontHeight
protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws java.io.IOException- Throws:
java.io.IOException
-
-