Class PDFText2Markdown


  • public class PDFText2Markdown
    extends org.apache.pdfbox.text.PDFTextStripper
    Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding Markdown paragraph. Bold and italic formatting is also applied based on font properties.
    Author:
    Saurav Rawat
    • Field Summary

      • Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

        charactersByArticle, document, LINE_SEPARATOR, output
    • Constructor Summary

      Constructors 
      Constructor Description
      PDFText2Markdown()
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)  
      protected void endArticle()
      Write out the article separator.
      protected void showGlyph​(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, java.lang.String arg3, org.apache.pdfbox.util.Vector arg4)  
      protected void startArticle​(boolean isLTR)
      Write out the article separator with proper text direction information.
      protected void writeParagraphEnd()
      Writes the Markdown paragraph end to the output.
      protected void writeString​(java.lang.String chars)
      Write a string to the output stream and escape some Markdown characters.
      protected void writeString​(java.lang.String text, java.util.List<org.apache.pdfbox.text.TextPosition> textPositions)
      Write a string to the output stream, maintain font state, and escape some Markdown characters.
      • Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

        beginMarkedContentSequence, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator
      • Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

        addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • PDFText2Markdown

        public PDFText2Markdown()
                         throws java.io.IOException
        Constructor.
        Throws:
        java.io.IOException - If there is an error during initialization.
    • Method Detail

      • startArticle

        protected void startArticle​(boolean isLTR)
                             throws java.io.IOException
        Write out the article separator with proper text direction information.
        Overrides:
        startArticle in class org.apache.pdfbox.text.PDFTextStripper
        Parameters:
        isLTR - true if direction of text is left to right
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • endArticle

        protected void endArticle()
                           throws java.io.IOException
        Write out the article separator.
        Overrides:
        endArticle in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String text,
                                   java.util.List<org.apache.pdfbox.text.TextPosition> textPositions)
                            throws java.io.IOException
        Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.
        Overrides:
        writeString in class org.apache.pdfbox.text.PDFTextStripper
        Parameters:
        text - The text to write to the stream.
        textPositions - The corresponding text positions.
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeString

        protected void writeString​(java.lang.String chars)
                            throws java.io.IOException
        Write a string to the output stream and escape some Markdown characters.
        Overrides:
        writeString in class org.apache.pdfbox.text.PDFTextStripper
        Parameters:
        chars - String to be written to the stream.
        Throws:
        java.io.IOException - If there is an error writing to the stream.
      • writeParagraphEnd

        protected void writeParagraphEnd()
                                  throws java.io.IOException
        Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.

        Overrides:
        writeParagraphEnd in class org.apache.pdfbox.text.PDFTextStripper
        Throws:
        java.io.IOException
      • showGlyph

        protected void showGlyph​(org.apache.pdfbox.util.Matrix arg0,
                                 org.apache.pdfbox.pdmodel.font.PDFont arg1,
                                 int arg2,
                                 java.lang.String arg3,
                                 org.apache.pdfbox.util.Vector arg4)
                          throws java.io.IOException
        Overrides:
        showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
        Throws:
        java.io.IOException
      • computeFontHeight

        protected float computeFontHeight​(org.apache.pdfbox.pdmodel.font.PDFont arg0)
                                   throws java.io.IOException
        Throws:
        java.io.IOException