Class BaseParser

java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
Direct Known Subclasses:
COSParser, PDFObjectStreamParser, PDFStreamParser, PDFXrefStreamParser

public abstract class BaseParser extends Object
This class is used to contain parsing logic that will be used by all parsers.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected static final int
     
    private static final Charset
     
    protected static final byte
    ASCII code for carriage return.
    protected static final byte
    ASCII code for line feed.
    private static final byte
     
    private static final byte
     
    private static final byte
     
    protected static final int
     
    protected static final int
     
    static final String
    This is a string constant that will be used for comparisons.
    protected COSDocument
    This is the document that will be parsed.
    protected static final int
     
    protected static final String
    This is a string constant that will be used for comparisons.
    protected static final String
    This is a string constant that will be used for comparisons.
    private static final char[]
    This is a string constant that will be used for comparisons.
    private static final long
     
    protected static final int
     
    private final Map<Long,COSObjectKey>
     
    private static final org.apache.commons.logging.Log
    Log instance.
    protected static final int
     
    (package private) static final int
     
    private static final int
     
    private static final String
     
    protected static final int
     
    private static final char[]
    This is a string constant that will be used for comparisons.
    protected static final int
     
    private static final long
     
    protected static final int
     
    private int
     
    protected static final int
     
    protected final RandomAccessRead
    This is the stream that will be read from.
    protected static final String
    This is a string constant that will be used for comparisons.
    protected static final int
     
    private static final char[]
    This is a string constant that will be used for comparisons.
    private final CharsetDecoder
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    Default constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    private int
    checkForEndOfString(int bracesParameter)
    This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place.
    private String
    Tries to decode the buffer cotent to an UTF-8 String.
    private COSBase
     
    protected COSObjectKey
    getObjectKey(long num, int gen)
    Returns the object key for the given combination of object and generation number.
    protected boolean
    This will tell if the next character is a closing brace( close of PDF array ).
    protected boolean
    isClosing(int c)
    Deprecated.
    This unused method will be removed in 4.0.
    private boolean
    isCR(int c)
     
    protected boolean
    This will tell if the next byte is a digit or not.
    protected static boolean
    isDigit(int c)
    This will tell if the given value is a digit or not.
    protected boolean
    isEndOfName(int ch)
    Determine if a character terminates a PDF name.
    protected boolean
    This will tell if the end of the data is reached.
    protected boolean
    This will tell if the next byte to be read is an end of line byte.
    protected boolean
    isEOL(int c)
    This will tell if the next byte to be read is an end of line byte.
    private static boolean
    isHexDigit(char ch)
     
    private boolean
    isLF(int c)
     
    protected boolean
    This will tell if the next byte is a space or not.
    protected boolean
    isSpace(int c)
    This will tell if the given value is a space or not.
    protected boolean
    This will tell if the next byte is whitespace or not.
    protected static boolean
    isWhitespace(int c)
    This will tell if a character is whitespace or not.
    protected COSArray
    This will parse a PDF array object.
    protected COSDictionary
    parseCOSDictionary(boolean isDirect)
    This will parse a PDF dictionary.
    private boolean
     
    private COSBase
    This will parse a PDF dictionary value.
    private COSString
    This will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found.
    protected COSName
    This will parse a PDF name from the stream.
    private COSNumber
     
    protected COSString
    This will parse a PDF string.
    protected COSBase
    This will parse a directory object from the stream.
    protected void
    Read one char and throw an exception if it is not the expected value.
    protected final void
    readExpectedString(char[] expectedString, boolean skipSpaces)
    Reads given pattern from source.
    protected int
    This will read a integer from the Stream and throw an IllegalArgumentException if the integer value has more than the maximum object revision (i.e.
    protected int
    This will read an integer from the stream.
    protected String
    This will read bytes until the first end of line marker occurs.
    protected long
    This will read an long from the stream.
    protected long
    This will read a long from the Stream and throw an IOException if the long value is negative or has more than 10 digits (i.e.
    protected String
    This will read the next string from the stream.
    protected String
    readString(int length)
    Deprecated.
    this unused method will be removed in 4.0.
    protected final StringBuilder
    This method is used to read a token by the readInt() and the readLong() method.
    private boolean
    Keep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.
    protected boolean
    Skip one line break, such as CR, LF or CRLF.
    private boolean
    skipLinebreak(int linebreak)
    Skip one line break, such as CR, LF or CRLF.
    protected void
    This will skip all spaces and comments that are present.
    protected void
    Skip the upcoming CRLF or LF which are supposed to follow a stream.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • LOG

      private static final org.apache.commons.logging.Log LOG
      Log instance.
    • OBJECT_NUMBER_THRESHOLD

      private static final long OBJECT_NUMBER_THRESHOLD
      See Also:
    • GENERATION_NUMBER_THRESHOLD

      private static final long GENERATION_NUMBER_THRESHOLD
      See Also:
    • MAX_LENGTH_LONG

      static final int MAX_LENGTH_LONG
    • ALTERNATIVE_CHARSET

      private static final Charset ALTERNATIVE_CHARSET
    • MAX_RECURSION_DEPTH

      private static final int MAX_RECURSION_DEPTH
      See Also:
    • MAX_RECUSRION_MSG

      private static final String MAX_RECUSRION_MSG
    • recursionDepth

      private int recursionDepth
    • keyCache

      private final Map<Long,COSObjectKey> keyCache
    • utf8Decoder

      private final CharsetDecoder utf8Decoder
    • E

      protected static final int E
      See Also:
    • N

      protected static final int N
      See Also:
    • D

      protected static final int D
      See Also:
    • S

      protected static final int S
      See Also:
    • T

      protected static final int T
      See Also:
    • R

      protected static final int R
      See Also:
    • A

      protected static final int A
      See Also:
    • M

      protected static final int M
      See Also:
    • O

      protected static final int O
      See Also:
    • B

      protected static final int B
      See Also:
    • J

      protected static final int J
      See Also:
    • DEF

      public static final String DEF
      This is a string constant that will be used for comparisons.
      See Also:
    • ENDOBJ_STRING

      protected static final String ENDOBJ_STRING
      This is a string constant that will be used for comparisons.
      See Also:
    • ENDSTREAM_STRING

      protected static final String ENDSTREAM_STRING
      This is a string constant that will be used for comparisons.
      See Also:
    • STREAM_STRING

      protected static final String STREAM_STRING
      This is a string constant that will be used for comparisons.
      See Also:
    • TRUE

      private static final char[] TRUE
      This is a string constant that will be used for comparisons.
    • FALSE

      private static final char[] FALSE
      This is a string constant that will be used for comparisons.
    • NULL

      private static final char[] NULL
      This is a string constant that will be used for comparisons.
    • ASCII_LF

      protected static final byte ASCII_LF
      ASCII code for line feed.
      See Also:
    • ASCII_CR

      protected static final byte ASCII_CR
      ASCII code for carriage return.
      See Also:
    • ASCII_ZERO

      private static final byte ASCII_ZERO
      See Also:
    • ASCII_NINE

      private static final byte ASCII_NINE
      See Also:
    • ASCII_SPACE

      private static final byte ASCII_SPACE
      See Also:
    • source

      protected final RandomAccessRead source
      This is the stream that will be read from.
    • document

      protected COSDocument document
      This is the document that will be parsed.
  • Constructor Details

  • Method Details

    • isHexDigit

      private static boolean isHexDigit(char ch)
    • getObjectKey

      protected COSObjectKey getObjectKey(long num, int gen)
      Returns the object key for the given combination of object and generation number. The object key from the cross reference table/stream will be reused if available. Otherwise a newly created object will be returned.
      Parameters:
      num - the given object number
      gen - the given generation number
      Returns:
      the COS object key
    • parseCOSDictionaryValue

      private COSBase parseCOSDictionaryValue() throws IOException
      This will parse a PDF dictionary value.
      Returns:
      The parsed Dictionary object.
      Throws:
      IOException - If there is an error parsing the dictionary object.
    • getObjectFromPool

      private COSBase getObjectFromPool(COSObjectKey key) throws IOException
      Throws:
      IOException
    • parseCOSDictionary

      protected COSDictionary parseCOSDictionary(boolean isDirect) throws IOException
      This will parse a PDF dictionary.
      Parameters:
      isDirect - indicates whether the dictionary to be read is a direct object
      Returns:
      The parsed dictionary, never null.
      Throws:
      IOException - If there is an error reading the stream.
    • readUntilEndOfCOSDictionary

      private boolean readUntilEndOfCOSDictionary() throws IOException
      Keep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.
      Returns:
      true if the end of the object or the file has been found, false if not, i.e. that the caller can continue to parse the dictionary at the current position.
      Throws:
      IOException - if there is a reading error.
    • parseCOSDictionaryNameValuePair

      private boolean parseCOSDictionaryNameValuePair(COSDictionary obj) throws IOException
      Throws:
      IOException
    • skipWhiteSpaces

      protected void skipWhiteSpaces() throws IOException
      Skip the upcoming CRLF or LF which are supposed to follow a stream. Trailing spaces are removed as well.
      Throws:
      IOException - if something went wrong
    • skipLinebreak

      protected boolean skipLinebreak() throws IOException
      Skip one line break, such as CR, LF or CRLF.
      Returns:
      true if a line break was found and removed.
      Throws:
      IOException - if something went wrong
    • skipLinebreak

      private boolean skipLinebreak(int linebreak) throws IOException
      Skip one line break, such as CR, LF or CRLF.
      Parameters:
      linebreak - the first character to be checked.
      Returns:
      true if a line break was found and removed.
      Throws:
      IOException - if something went wrong
    • checkForEndOfString

      private int checkForEndOfString(int bracesParameter) throws IOException
      This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place. However it missed the case where the number of opening and closing parenthesis isn't balanced The second bug was in this format /Title (c:\) /Producer
      Parameters:
      bracesParameter - the number of braces currently open.
      Returns:
      the corrected value of the brace counter
      Throws:
      IOException
    • parseCOSString

      protected COSString parseCOSString() throws IOException
      This will parse a PDF string.
      Returns:
      The parsed PDF string.
      Throws:
      IOException - If there is an error reading from the stream.
    • parseCOSHexString

      private COSString parseCOSHexString() throws IOException
      This will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found. This is necessary in order to detect malformed input and be able to skip to next object start. We assume starting '<' was already read.
      Returns:
      The parsed PDF string.
      Throws:
      IOException - If there is an error reading from the stream.
    • parseCOSArray

      protected COSArray parseCOSArray() throws IOException
      This will parse a PDF array object.
      Returns:
      The parsed PDF array.
      Throws:
      IOException - If there is an error parsing the stream.
    • isEndOfName

      protected boolean isEndOfName(int ch)
      Determine if a character terminates a PDF name.
      Parameters:
      ch - The character
      Returns:
      true if the character terminates a PDF name, otherwise false.
    • parseCOSName

      protected COSName parseCOSName() throws IOException
      This will parse a PDF name from the stream.
      Returns:
      The parsed PDF name.
      Throws:
      IOException - If there is an error reading from the stream.
    • decodeBuffer

      private String decodeBuffer(ByteArrayOutputStream buffer) throws UnsupportedEncodingException
      Tries to decode the buffer cotent to an UTF-8 String. If that fails, tries the alternative Encoding.
      Parameters:
      buffer - the ByteArrayOutputStream containing the bytes to decode
      Returns:
      the decoded String
      Throws:
      UnsupportedEncodingException
    • parseDirObject

      protected COSBase parseDirObject() throws IOException
      This will parse a directory object from the stream.
      Returns:
      The parsed object.
      Throws:
      IOException - If there is an error during parsing.
    • parseCOSNumber

      private COSNumber parseCOSNumber() throws IOException
      Throws:
      IOException
    • readString

      protected String readString() throws IOException
      This will read the next string from the stream.
      Returns:
      The string that was read from the stream, never null.
      Throws:
      IOException - If there is an error reading from the stream.
    • readExpectedString

      protected final void readExpectedString(char[] expectedString, boolean skipSpaces) throws IOException
      Reads given pattern from source. Skipping whitespace at start and end if wanted.
      Parameters:
      expectedString - pattern to be skipped
      skipSpaces - if set to true spaces before and after the string will be skipped
      Throws:
      IOException - if pattern could not be read
    • readExpectedChar

      protected void readExpectedChar(char ec) throws IOException
      Read one char and throw an exception if it is not the expected value.
      Parameters:
      ec - the char value that is expected.
      Throws:
      IOException - if the read char is not the expected value or if an I/O error occurs.
    • readString

      @Deprecated protected String readString(int length) throws IOException
      Deprecated.
      this unused method will be removed in 4.0.
      This will read the next string from the stream up to a certain length.
      Parameters:
      length - The length to stop reading at.
      Returns:
      The string that was read from the stream of length 0 to length.
      Throws:
      IOException - If there is an error reading from the stream.
    • isClosing

      protected boolean isClosing() throws IOException
      This will tell if the next character is a closing brace( close of PDF array ).
      Returns:
      true if the next byte is ']', false otherwise.
      Throws:
      IOException - If an IO error occurs.
    • isClosing

      @Deprecated protected boolean isClosing(int c)
      Deprecated.
      This unused method will be removed in 4.0.
      This will tell if the next character is a closing brace( close of PDF array ).
      Parameters:
      c - The character to check against end of line
      Returns:
      true if the next byte is ']', false otherwise.
    • readLine

      protected String readLine() throws IOException
      This will read bytes until the first end of line marker occurs. NOTE: The EOL marker may consists of 1 (CR or LF) or 2 (CR and CL) bytes which is an important detail if one wants to unread the line.
      Returns:
      The characters between the current position and the end of the line.
      Throws:
      IOException - If there is an error reading from the stream.
    • isEOL

      protected boolean isEOL() throws IOException
      This will tell if the next byte to be read is an end of line byte.
      Returns:
      true if the next byte is 0x0A or 0x0D.
      Throws:
      IOException - If there is an error reading from the stream.
    • isEOF

      protected boolean isEOF() throws IOException
      This will tell if the end of the data is reached.
      Returns:
      true if the end of the data is reached.
      Throws:
      IOException - If there is an error reading from the stream.
    • isEOL

      protected boolean isEOL(int c)
      This will tell if the next byte to be read is an end of line byte.
      Parameters:
      c - The character to check against end of line
      Returns:
      true if the next byte is 0x0A or 0x0D.
    • isLF

      private boolean isLF(int c)
    • isCR

      private boolean isCR(int c)
    • isWhitespace

      protected boolean isWhitespace() throws IOException
      This will tell if the next byte is whitespace or not.
      Returns:
      true if the next byte in the stream is a whitespace character.
      Throws:
      IOException - If there is an error reading from the stream.
    • isWhitespace

      protected static boolean isWhitespace(int c)
      This will tell if a character is whitespace or not. These values are specified in table 1 (page 12) of ISO 32000-1:2008.
      Parameters:
      c - The character to check against whitespace
      Returns:
      true if the character is a whitespace character.
    • isSpace

      protected boolean isSpace() throws IOException
      This will tell if the next byte is a space or not.
      Returns:
      true if the next byte in the stream is a space character.
      Throws:
      IOException - If there is an error reading from the stream.
    • isSpace

      protected boolean isSpace(int c)
      This will tell if the given value is a space or not.
      Parameters:
      c - The character to check against space
      Returns:
      true if the next byte in the stream is a space character.
    • isDigit

      protected boolean isDigit() throws IOException
      This will tell if the next byte is a digit or not.
      Returns:
      true if the next byte in the stream is a digit.
      Throws:
      IOException - If there is an error reading from the stream.
    • isDigit

      protected static boolean isDigit(int c)
      This will tell if the given value is a digit or not.
      Parameters:
      c - The character to be checked
      Returns:
      true if the next byte in the stream is a digit.
    • skipSpaces

      protected void skipSpaces() throws IOException
      This will skip all spaces and comments that are present.
      Throws:
      IOException - If there is an error reading from the stream.
    • readObjectNumber

      protected long readObjectNumber() throws IOException
      This will read a long from the Stream and throw an IOException if the long value is negative or has more than 10 digits (i.e. : bigger than OBJECT_NUMBER_THRESHOLD)
      Returns:
      the object number being read.
      Throws:
      IOException - if an I/O error occurs
    • readGenerationNumber

      protected int readGenerationNumber() throws IOException
      This will read a integer from the Stream and throw an IllegalArgumentException if the integer value has more than the maximum object revision (i.e. : bigger than GENERATION_NUMBER_THRESHOLD)
      Returns:
      the generation number being read.
      Throws:
      IOException - if an I/O error occurs
    • readInt

      protected int readInt() throws IOException
      This will read an integer from the stream.
      Returns:
      The integer that was read from the stream.
      Throws:
      IOException - If there is an error reading from the stream.
    • readLong

      protected long readLong() throws IOException
      This will read an long from the stream.
      Returns:
      The long that was read from the stream.
      Throws:
      IOException - If there is an error reading from the stream.
    • readStringNumber

      protected final StringBuilder readStringNumber() throws IOException
      This method is used to read a token by the readInt() and the readLong() method. Valid delimiters are any non digit values.
      Returns:
      the token to parse as integer or long by the calling method.
      Throws:
      IOException - throws by the source methods.