Class PDFText2Markdown

Direct Known Subclasses:
FilteredText2Markdown

public class PDFText2Markdown extends PDFTextStripper
Convert PDF text to Markdown format. Each line in the PDF is converted to a corresponding Markdown paragraph. Bold and italic formatting is also applied based on font properties.
  • Field Details

  • Constructor Details

    • PDFText2Markdown

      public PDFText2Markdown()
      Constructor.
  • Method Details

    • escape

      private static String escape(String chars)
      Escape some Markdown characters.
      Parameters:
      chars - String to be escaped
      Returns:
      returns escaped String.
    • appendEscaped

      private static void appendEscaped(StringBuilder builder, char character)
    • startArticle

      protected void startArticle(boolean isLTR) throws IOException
      Write out the article separator with proper text direction information.
      Overrides:
      startArticle in class PDFTextStripper
      Parameters:
      isLTR - true if direction of text is left to right
      Throws:
      IOException - If there is an error writing to the stream.
    • endArticle

      protected void endArticle() throws IOException
      Write out the article separator.
      Overrides:
      endArticle in class PDFTextStripper
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String text, List<TextPosition> textPositions) throws IOException
      Write a string to the output stream, maintain font state, and escape some Markdown characters. The font state is only preserved per word.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      text - The text to write to the stream.
      textPositions - The corresponding text positions.
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String chars) throws IOException
      Write a string to the output stream and escape some Markdown characters.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      chars - String to be written to the stream.
      Throws:
      IOException - If there is an error writing to the stream.
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Writes the Markdown paragraph end to the output. Furthermore, it will also clear the font state.

      Write something (if defined) at the end of a paragraph.

      Overrides:
      writeParagraphEnd in class PDFTextStripper
      Throws:
      IOException - if something went wrong