Package org.apache.pdfbox.pdfparser
Class BaseParser
java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
- Direct Known Subclasses:
COSParser,PDFObjectStreamParser,PDFStreamParser,PDFXrefStreamParser
This class is used to contain parsing logic that will be used by all parsers.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final intprivate static final Charsetprotected static final byteASCII code for carriage return.protected static final byteASCII code for line feed.private static final byteprivate static final byteprivate static final byteprotected static final intprotected static final intstatic final StringThis is a string constant that will be used for comparisons.protected COSDocumentThis is the document that will be parsed.protected static final intprotected static final StringThis is a string constant that will be used for comparisons.protected static final StringThis is a string constant that will be used for comparisons.private static final char[]This is a string constant that will be used for comparisons.private static final longprotected static final intprivate final Map<Long, COSObjectKey> private static final org.apache.commons.logging.LogLog instance.protected static final int(package private) static final intprivate static final intprivate static final Stringprotected static final intprivate static final char[]This is a string constant that will be used for comparisons.protected static final intprivate static final longprotected static final intprivate intprotected static final intprotected final RandomAccessReadThis is the stream that will be read from.protected static final StringThis is a string constant that will be used for comparisons.protected static final intprivate static final char[]This is a string constant that will be used for comparisons.private final CharsetDecoder -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate intcheckForEndOfString(int bracesParameter) This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place.private StringdecodeBuffer(ByteArrayOutputStream buffer) Tries to decode the buffer cotent to an UTF-8 String.private COSBaseprotected COSObjectKeygetObjectKey(long num, int gen) Returns the object key for the given combination of object and generation number.protected booleanThis will tell if the next character is a closing brace( close of PDF array ).protected booleanisClosing(int c) Deprecated.This unused method will be removed in 4.0.private booleanisCR(int c) protected booleanisDigit()This will tell if the next byte is a digit or not.protected static booleanisDigit(int c) This will tell if the given value is a digit or not.protected booleanisEndOfName(int ch) Determine if a character terminates a PDF name.protected booleanisEOF()This will tell if the end of the data is reached.protected booleanisEOL()This will tell if the next byte to be read is an end of line byte.protected booleanisEOL(int c) This will tell if the next byte to be read is an end of line byte.private static booleanisHexDigit(char ch) private booleanisLF(int c) protected booleanisSpace()This will tell if the next byte is a space or not.protected booleanisSpace(int c) This will tell if the given value is a space or not.protected booleanThis will tell if the next byte is whitespace or not.protected static booleanisWhitespace(int c) This will tell if a character is whitespace or not.protected COSArrayThis will parse a PDF array object.protected COSDictionaryparseCOSDictionary(boolean isDirect) This will parse a PDF dictionary.private booleanprivate COSBaseThis will parse a PDF dictionary value.private COSStringThis will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found.protected COSNameThis will parse a PDF name from the stream.private COSNumberprotected COSStringThis will parse a PDF string.protected COSBaseThis will parse a directory object from the stream.protected voidreadExpectedChar(char ec) Read one char and throw an exception if it is not the expected value.protected final voidreadExpectedString(char[] expectedString, boolean skipSpaces) Reads given pattern fromsource.protected intThis will read a integer from the Stream and throw anIllegalArgumentExceptionif the integer value has more than the maximum object revision (i.e.protected intreadInt()This will read an integer from the stream.protected StringreadLine()This will read bytes until the first end of line marker occurs.protected longreadLong()This will read an long from the stream.protected longThis will read a long from the Stream and throw anIOExceptionif the long value is negative or has more than 10 digits (i.e.protected StringThis will read the next string from the stream.protected StringreadString(int length) Deprecated.this unused method will be removed in 4.0.protected final StringBuilderThis method is used to read a token by the readInt() and the readLong() method.private booleanKeep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.protected booleanSkip one line break, such as CR, LF or CRLF.private booleanskipLinebreak(int linebreak) Skip one line break, such as CR, LF or CRLF.protected voidThis will skip all spaces and comments that are present.protected voidSkip the upcoming CRLF or LF which are supposed to follow a stream.
-
Field Details
-
LOG
private static final org.apache.commons.logging.Log LOGLog instance. -
OBJECT_NUMBER_THRESHOLD
private static final long OBJECT_NUMBER_THRESHOLD- See Also:
-
GENERATION_NUMBER_THRESHOLD
private static final long GENERATION_NUMBER_THRESHOLD- See Also:
-
MAX_LENGTH_LONG
static final int MAX_LENGTH_LONG -
ALTERNATIVE_CHARSET
-
MAX_RECURSION_DEPTH
private static final int MAX_RECURSION_DEPTH- See Also:
-
MAX_RECUSRION_MSG
-
recursionDepth
private int recursionDepth -
keyCache
-
utf8Decoder
-
E
protected static final int E- See Also:
-
N
protected static final int N- See Also:
-
D
protected static final int D- See Also:
-
S
protected static final int S- See Also:
-
T
protected static final int T- See Also:
-
R
protected static final int R- See Also:
-
A
protected static final int A- See Also:
-
M
protected static final int M- See Also:
-
O
protected static final int O- See Also:
-
B
protected static final int B- See Also:
-
J
protected static final int J- See Also:
-
DEF
This is a string constant that will be used for comparisons.- See Also:
-
ENDOBJ_STRING
This is a string constant that will be used for comparisons.- See Also:
-
ENDSTREAM_STRING
This is a string constant that will be used for comparisons.- See Also:
-
STREAM_STRING
This is a string constant that will be used for comparisons.- See Also:
-
TRUE
private static final char[] TRUEThis is a string constant that will be used for comparisons. -
FALSE
private static final char[] FALSEThis is a string constant that will be used for comparisons. -
NULL
private static final char[] NULLThis is a string constant that will be used for comparisons. -
ASCII_LF
protected static final byte ASCII_LFASCII code for line feed.- See Also:
-
ASCII_CR
protected static final byte ASCII_CRASCII code for carriage return.- See Also:
-
ASCII_ZERO
private static final byte ASCII_ZERO- See Also:
-
ASCII_NINE
private static final byte ASCII_NINE- See Also:
-
ASCII_SPACE
private static final byte ASCII_SPACE- See Also:
-
source
This is the stream that will be read from. -
document
This is the document that will be parsed.
-
-
Constructor Details
-
BaseParser
BaseParser(RandomAccessRead pdfSource) Default constructor.
-
-
Method Details
-
isHexDigit
private static boolean isHexDigit(char ch) -
getObjectKey
Returns the object key for the given combination of object and generation number. The object key from the cross reference table/stream will be reused if available. Otherwise a newly created object will be returned.- Parameters:
num- the given object numbergen- the given generation number- Returns:
- the COS object key
-
parseCOSDictionaryValue
This will parse a PDF dictionary value.- Returns:
- The parsed Dictionary object.
- Throws:
IOException- If there is an error parsing the dictionary object.
-
getObjectFromPool
- Throws:
IOException
-
parseCOSDictionary
This will parse a PDF dictionary.- Parameters:
isDirect- indicates whether the dictionary to be read is a direct object- Returns:
- The parsed dictionary, never null.
- Throws:
IOException- If there is an error reading the stream.
-
readUntilEndOfCOSDictionary
Keep reading until the end of the dictionary object or the file has been hit, or until a '/' has been found.- Returns:
- true if the end of the object or the file has been found, false if not, i.e. that the caller can continue to parse the dictionary at the current position.
- Throws:
IOException- if there is a reading error.
-
parseCOSDictionaryNameValuePair
- Throws:
IOException
-
skipWhiteSpaces
Skip the upcoming CRLF or LF which are supposed to follow a stream. Trailing spaces are removed as well.- Throws:
IOException- if something went wrong
-
skipLinebreak
Skip one line break, such as CR, LF or CRLF.- Returns:
- true if a line break was found and removed.
- Throws:
IOException- if something went wrong
-
skipLinebreak
Skip one line break, such as CR, LF or CRLF.- Parameters:
linebreak- the first character to be checked.- Returns:
- true if a line break was found and removed.
- Throws:
IOException- if something went wrong
-
checkForEndOfString
This is really a bug in the Document creators code, but it caused a crash in PDFBox, the first bug was in this format: /Title ( (5) /Creator which was patched in 1 place. However it missed the case where the number of opening and closing parenthesis isn't balanced The second bug was in this format /Title (c:\) /Producer- Parameters:
bracesParameter- the number of braces currently open.- Returns:
- the corrected value of the brace counter
- Throws:
IOException
-
parseCOSString
This will parse a PDF string.- Returns:
- The parsed PDF string.
- Throws:
IOException- If there is an error reading from the stream.
-
parseCOSHexString
This will parse a PDF HEX string with fail fast semantic meaning that we stop if a not allowed character is found. This is necessary in order to detect malformed input and be able to skip to next object start. We assume starting '<' was already read.- Returns:
- The parsed PDF string.
- Throws:
IOException- If there is an error reading from the stream.
-
parseCOSArray
This will parse a PDF array object.- Returns:
- The parsed PDF array.
- Throws:
IOException- If there is an error parsing the stream.
-
isEndOfName
protected boolean isEndOfName(int ch) Determine if a character terminates a PDF name.- Parameters:
ch- The character- Returns:
- true if the character terminates a PDF name, otherwise false.
-
parseCOSName
This will parse a PDF name from the stream.- Returns:
- The parsed PDF name.
- Throws:
IOException- If there is an error reading from the stream.
-
decodeBuffer
Tries to decode the buffer cotent to an UTF-8 String. If that fails, tries the alternative Encoding.- Parameters:
buffer- theByteArrayOutputStreamcontaining the bytes to decode- Returns:
- the decoded String
- Throws:
UnsupportedEncodingException
-
parseDirObject
This will parse a directory object from the stream.- Returns:
- The parsed object.
- Throws:
IOException- If there is an error during parsing.
-
parseCOSNumber
- Throws:
IOException
-
readString
This will read the next string from the stream.- Returns:
- The string that was read from the stream, never null.
- Throws:
IOException- If there is an error reading from the stream.
-
readExpectedString
protected final void readExpectedString(char[] expectedString, boolean skipSpaces) throws IOException Reads given pattern fromsource. Skipping whitespace at start and end if wanted.- Parameters:
expectedString- pattern to be skippedskipSpaces- if set to true spaces before and after the string will be skipped- Throws:
IOException- if pattern could not be read
-
readExpectedChar
Read one char and throw an exception if it is not the expected value.- Parameters:
ec- the char value that is expected.- Throws:
IOException- if the read char is not the expected value or if an I/O error occurs.
-
readString
Deprecated.this unused method will be removed in 4.0.This will read the next string from the stream up to a certain length.- Parameters:
length- The length to stop reading at.- Returns:
- The string that was read from the stream of length 0 to length.
- Throws:
IOException- If there is an error reading from the stream.
-
isClosing
This will tell if the next character is a closing brace( close of PDF array ).- Returns:
- true if the next byte is ']', false otherwise.
- Throws:
IOException- If an IO error occurs.
-
isClosing
Deprecated.This unused method will be removed in 4.0.This will tell if the next character is a closing brace( close of PDF array ).- Parameters:
c- The character to check against end of line- Returns:
- true if the next byte is ']', false otherwise.
-
readLine
This will read bytes until the first end of line marker occurs. NOTE: The EOL marker may consists of 1 (CR or LF) or 2 (CR and CL) bytes which is an important detail if one wants to unread the line.- Returns:
- The characters between the current position and the end of the line.
- Throws:
IOException- If there is an error reading from the stream.
-
isEOL
This will tell if the next byte to be read is an end of line byte.- Returns:
- true if the next byte is 0x0A or 0x0D.
- Throws:
IOException- If there is an error reading from the stream.
-
isEOF
This will tell if the end of the data is reached.- Returns:
- true if the end of the data is reached.
- Throws:
IOException- If there is an error reading from the stream.
-
isEOL
protected boolean isEOL(int c) This will tell if the next byte to be read is an end of line byte.- Parameters:
c- The character to check against end of line- Returns:
- true if the next byte is 0x0A or 0x0D.
-
isLF
private boolean isLF(int c) -
isCR
private boolean isCR(int c) -
isWhitespace
This will tell if the next byte is whitespace or not.- Returns:
- true if the next byte in the stream is a whitespace character.
- Throws:
IOException- If there is an error reading from the stream.
-
isWhitespace
protected static boolean isWhitespace(int c) This will tell if a character is whitespace or not. These values are specified in table 1 (page 12) of ISO 32000-1:2008.- Parameters:
c- The character to check against whitespace- Returns:
- true if the character is a whitespace character.
-
isSpace
This will tell if the next byte is a space or not.- Returns:
- true if the next byte in the stream is a space character.
- Throws:
IOException- If there is an error reading from the stream.
-
isSpace
protected boolean isSpace(int c) This will tell if the given value is a space or not.- Parameters:
c- The character to check against space- Returns:
- true if the next byte in the stream is a space character.
-
isDigit
This will tell if the next byte is a digit or not.- Returns:
- true if the next byte in the stream is a digit.
- Throws:
IOException- If there is an error reading from the stream.
-
isDigit
protected static boolean isDigit(int c) This will tell if the given value is a digit or not.- Parameters:
c- The character to be checked- Returns:
- true if the next byte in the stream is a digit.
-
skipSpaces
This will skip all spaces and comments that are present.- Throws:
IOException- If there is an error reading from the stream.
-
readObjectNumber
This will read a long from the Stream and throw anIOExceptionif the long value is negative or has more than 10 digits (i.e. : bigger thanOBJECT_NUMBER_THRESHOLD)- Returns:
- the object number being read.
- Throws:
IOException- if an I/O error occurs
-
readGenerationNumber
This will read a integer from the Stream and throw anIllegalArgumentExceptionif the integer value has more than the maximum object revision (i.e. : bigger thanGENERATION_NUMBER_THRESHOLD)- Returns:
- the generation number being read.
- Throws:
IOException- if an I/O error occurs
-
readInt
This will read an integer from the stream.- Returns:
- The integer that was read from the stream.
- Throws:
IOException- If there is an error reading from the stream.
-
readLong
This will read an long from the stream.- Returns:
- The long that was read from the stream.
- Throws:
IOException- If there is an error reading from the stream.
-
readStringNumber
This method is used to read a token by the readInt() and the readLong() method. Valid delimiters are any non digit values.- Returns:
- the token to parse as integer or long by the calling method.
- Throws:
IOException- throws by thesourcemethods.
-