Package org.apache.pdfbox.tools
Class PDFText2Markdown
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.LegacyPDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.pdfbox.tools.PDFText2Markdown
Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding
Markdown paragraph. Bold and italic formatting is also applied based on font properties.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprivate static classA helper class to maintain the current font state. -
Field Summary
FieldsFields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate static voidappendEscaped(StringBuilder builder, char character) protected voidWrite out the article separator.private static StringEscape some Markdown characters.protected voidstartArticle(boolean isLTR) Write out the article separator with proper text direction information.protected voidWrites the Markdown paragraph end to the output.protected voidwriteString(String chars) Write a string to the output stream and escape some Markdown characters.protected voidwriteString(String text, List<TextPosition> textPositions) Write a string to the output stream, maintain font state, and escape some Markdown characters.Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparatorMethods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, showGlyphMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
fontState
-
-
Constructor Details
-
PDFText2Markdown
public PDFText2Markdown()Constructor.
-
-
Method Details
-
escape
Escape some Markdown characters.- Parameters:
chars- String to be escaped- Returns:
- returns escaped String.
-
appendEscaped
-
startArticle
Write out the article separator with proper text direction information.- Overrides:
startArticlein classPDFTextStripper- Parameters:
isLTR- true if direction of text is left to right- Throws:
IOException- If there is an error writing to the stream.
-
endArticle
Write out the article separator.- Overrides:
endArticlein classPDFTextStripper- Throws:
IOException- If there is an error writing to the stream.
-
writeString
Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.- Overrides:
writeStringin classPDFTextStripper- Parameters:
text- The text to write to the stream.textPositions- The corresponding text positions.- Throws:
IOException- If there is an error writing to the stream.
-
writeString
Write a string to the output stream and escape some Markdown characters.- Overrides:
writeStringin classPDFTextStripper- Parameters:
chars- String to be written to the stream.- Throws:
IOException- If there is an error writing to the stream.
-
writeParagraphEnd
Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.Write something (if defined) at the end of a paragraph.
- Overrides:
writeParagraphEndin classPDFTextStripper- Throws:
IOException- if something went wrong
-