Package org.apache.pdfbox.pdfparser
Class COSParser
java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.COSParser
- All Implemented Interfaces:
ICOSParser
- Direct Known Subclasses:
BruteForceParser,FDFParser,PDFParser
COS-Parser which first reads startxref and xref tables in order to know valid objects and parse only these objects.
This class is a much enhanced version of
QuickParser presented in
PDFBOX-1104 by Jeremy Villalobos.-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate AccessPermissionprivate BruteForceParserprivate final Map<Long, Map<COSObjectKey, COSBase>> Intermediate cache.private static final intHow many trailing bytes to read for EOF marker.private PDEncryptionprivate static final byte[]private static final byte[]protected static final char[]EOF-marker.private static final Stringprivate static final Stringprotected longfile length.protected booleanprivate booleanis parser using auto healing capacity ?private Stringprivate InputStreamprivate static final org.apache.commons.logging.Logprivate static final longprotected static final char[]obj-marker.private Stringprivate static final Stringprivate static final Stringprivate inthow many trailing bytes to read for EOF marker.protected SecurityHandler<? extends ProtectionPolicy> The security handler.private static final char[]private final byte[]private static final intstatic final StringThe range within the %%EOF marker will be searched.private booleanprivate static final intprivate static final char[]protected XrefTrailerResolverCollects all Xref/trailer objects and resolves them into single object using startxref reference.Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
A, ASCII_CR, ASCII_LF, B, D, DEF, document, E, ENDOBJ_STRING, ENDSTREAM_STRING, J, M, MAX_LENGTH_LONG, N, O, R, S, source, STREAM_STRING, T -
Constructor Summary
ConstructorsConstructorDescriptionCOSParser(RandomAccessRead source) Default constructor.COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias) Constructor for encrypted pdfs.COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias, RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction) Constructor for encrypted pdfs. -
Method Summary
Modifier and TypeMethodDescriptionprivate longcalculateXRefFixedOffset(long objectOffset) Try to find a fixed offset for the given xref table/stream.protected voidcheckPages(COSDictionary root) Check if all entries of the pages dictionary are present.private intcheckPagesDictionary(COSDictionary pagesDict, Set<COSObject> set) private longcheckXRefOffset(long startXRefOffset) Check if the cross reference table/stream can be found at the current offset.private voidCheck the XRef table by dereferencing all objects and fixing the offset if necessary.private booleancheckXRefStreamOffset(long startXRefOffset) Check if the cross reference stream can be found at the current offset.createRandomAccessReadView(long startPosition, long streamLength) Creates a random access read view starting at the given position with the given length.Dereference the COSBase object which is referenced by the given COSObject.private COSObjectKeyfindObjectKey(COSObjectKey objectKey, long offset, Map<COSObjectKey, Long> xrefOffset) Check if the given object can be found at the given offset.protected AccessPermissionThis will get the AccessPermission.private BruteForceParserprotected PDEncryptionThis will get the encryption dictionary.private COSNumberReturns length value referred to or defined in given object.private LonggetObjectOffset(COSObjectKey objKey, boolean requireExistingNotCompressedObj) private longLooks for and parses startxref.private voidinit(RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction) booleanReturn true if parser is lenient.private booleanisString(byte[] string) Checks if the given string can be found at the current offset.protected booleanisString(char[] string) Checks if the given string can be found at the current offset.protected intlastIndexOf(char[] pattern, byte[] buf, int endOff) Searches last appearance of pattern within buffer.protected COSStreamThis will read a COSStream from the input stream using length attribute within dictionary.protected booleanParse the header of a fdf.private COSBaseparseFileObject(Long objOffset, COSObjectKey objKey) private booleanparseHeader(String headerMarker, String defaultVersion) protected COSBaseparseObjectDynamically(COSObjectKey objKey, boolean requireExistingNotCompressedObj) Parse the object for the given object key.protected COSBaseparseObjectStreamObject(long objstmObjNr, COSObjectKey key) Parse the object with the given key from the object stream with the given number.protected booleanParse the header of a pdf.private longThis will parse the startxref section from the stream.private booleanThis will parse the trailer from the stream and add it to the state.private COSDictionaryparseXref(long startXRefOffset) Parses cross reference tables.private longparseXrefObjStream(long objByteOffset, boolean isStandalone) Parses an xref object stream starting with indirect object id.protected booleanparseXrefTable(long startByteOffset) This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.protected voidPrepare for decryption.private longThis method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object.protected booleanIndicates whether the xref trailer resolver should be reset or not.protected COSDictionaryRead the trailer information and provide a COSDictionary containing the trailer information.voidsetEOFLookupRange(int byteCount) Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.protected voidsetLenient(boolean lenient) Change the parser leniency flag.private booleanvalidateStreamLength(long streamLength) private booleanvalidateXrefOffsets(Map<COSObjectKey, Long> xrefOffset) Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
getObjectKey, isClosing, isClosing, isDigit, isDigit, isEndOfName, isEOF, isEOL, isEOL, isSpace, isSpace, isWhitespace, isWhitespace, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedChar, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, skipSpaces, skipWhiteSpaces
-
Field Details
-
PDF_HEADER
- See Also:
-
FDF_HEADER
- See Also:
-
PDF_DEFAULT_VERSION
- See Also:
-
FDF_DEFAULT_VERSION
- See Also:
-
XREF_TABLE
private static final char[] XREF_TABLE -
STARTXREF
private static final char[] STARTXREF -
ENDSTREAM
private static final byte[] ENDSTREAM -
ENDOBJ
private static final byte[] ENDOBJ -
MINIMUM_SEARCH_OFFSET
private static final long MINIMUM_SEARCH_OFFSET- See Also:
-
X
private static final int X- See Also:
-
STRMBUFLEN
private static final int STRMBUFLEN- See Also:
-
strmBuf
private final byte[] strmBuf -
accessPermission
-
keyStoreInputStream
-
password
-
keyAlias
-
SYSPROP_EOFLOOKUPRANGE
The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.- See Also:
-
DEFAULT_TRAIL_BYTECOUNT
private static final int DEFAULT_TRAIL_BYTECOUNTHow many trailing bytes to read for EOF marker.- See Also:
-
EOF_MARKER
protected static final char[] EOF_MARKEREOF-marker. -
OBJ_MARKER
protected static final char[] OBJ_MARKERobj-marker. -
fileLen
protected long fileLenfile length. -
isLenient
private boolean isLenientis parser using auto healing capacity ? -
initialParseDone
protected boolean initialParseDone -
trailerWasRebuild
private boolean trailerWasRebuild -
bruteForceParser
-
encryption
-
decompressedObjects
Intermediate cache. Contains all objects of already read compressed object streams. Objects are removed after dereferencing them. -
securityHandler
The security handler. -
readTrailBytes
private int readTrailByteshow many trailing bytes to read for EOF marker. -
LOG
private static final org.apache.commons.logging.Log LOG -
xrefTrailerResolver
Collects all Xref/trailer objects and resolves them into single object using startxref reference.
-
-
Constructor Details
-
COSParser
Default constructor.- Parameters:
source- input representing the pdf.- Throws:
IOException- if something went wrong
-
COSParser
public COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias) throws IOException Constructor for encrypted pdfs.- Parameters:
source- input representing the pdf.password- password to be used for decryption.keyStore- key store to be used for decryption when using public key securitykeyAlias- alias to be used for decryption when using public key security- Throws:
IOException- if the source data could not be read
-
COSParser
public COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias, RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction) throws IOException Constructor for encrypted pdfs.- Parameters:
source- input representing the pdf.password- password to be used for decryption.keyStore- key store to be used for decryption when using public key securitykeyAlias- alias to be used for decryption when using public key securitystreamCacheCreateFunction- a function to create an instance of the stream cache- Throws:
IOException- if the source data could not be read
-
-
Method Details
-
init
-
setEOFLookupRange
public void setEOFLookupRange(int byteCount) Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default valueDEFAULT_TRAIL_BYTECOUNT.We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.
In case system property
SYSPROP_EOFLOOKUPRANGEis defined this value will be set on initialization but can be overwritten later.- Parameters:
byteCount- number of trailing bytes
-
retrieveTrailer
Read the trailer information and provide a COSDictionary containing the trailer information.- Returns:
- a COSDictionary containing the trailer information
- Throws:
IOException- if something went wrong
-
resetTrailerResolver
protected boolean resetTrailerResolver()Indicates whether the xref trailer resolver should be reset or not. Should be overwritten if the xref trailer resolver is needed after the initial parsing.- Returns:
- true if the xref trailer resolver should be reset
-
parseXref
Parses cross reference tables.- Parameters:
startXRefOffset- start offset of the first table- Returns:
- the trailer dictionary
- Throws:
IOException- if something went wrong
-
parseXrefObjStream
Parses an xref object stream starting with indirect object id.- Returns:
- value of PREV item in dictionary or
-1if no such item exists - Throws:
IOException
-
getStartxrefOffset
Looks for and parses startxref. We first look for last '%%EOF' marker (within lastDEFAULT_TRAIL_BYTECOUNTbytes (or range set viasetEOFLookupRange(int)) and go back to findstartxref.- Returns:
- the offset of StartXref
- Throws:
IOException- If something went wrong.
-
lastIndexOf
protected int lastIndexOf(char[] pattern, byte[] buf, int endOff) Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.- Parameters:
pattern- pattern to search forbuf- buffer to search pattern inendOff- offset (exclusive) where lookup starts at- Returns:
- start offset of pattern within buffer or
-1if pattern could not be found
-
isLenient
public boolean isLenient()Return true if parser is lenient. Meaning auto healing capacity of the parser are used.- Returns:
- true if parser is lenient
-
setLenient
protected void setLenient(boolean lenient) Change the parser leniency flag. This method can only be called before the parsing of the file.- Parameters:
lenient- try to handle malformed PDFs.
-
dereferenceCOSObject
Description copied from interface:ICOSParserDereference the COSBase object which is referenced by the given COSObject.- Specified by:
dereferenceCOSObjectin interfaceICOSParser- Parameters:
obj- the COSObject which references the COSBase object to be dereferenced.- Returns:
- the referenced object
- Throws:
IOException- if something went wrong when dereferencing the COSBase object
-
createRandomAccessReadView
public RandomAccessReadView createRandomAccessReadView(long startPosition, long streamLength) throws IOException Description copied from interface:ICOSParserCreates a random access read view starting at the given position with the given length.- Specified by:
createRandomAccessReadViewin interfaceICOSParser- Parameters:
startPosition- start position within the underlying random access readstreamLength- stream length- Returns:
- the random access read view
- Throws:
IOException- if something went wrong when creating the view for the RandomAccessRead
-
parseObjectDynamically
protected COSBase parseObjectDynamically(COSObjectKey objKey, boolean requireExistingNotCompressedObj) throws IOException Parse the object for the given object key.- Parameters:
objKey- key of object to be parsedrequireExistingNotCompressedObj- iftruethe object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)- Returns:
- the parsed object (which is also added to document object)
- Throws:
IOException- If an IO error occurs.
-
getObjectOffset
private Long getObjectOffset(COSObjectKey objKey, boolean requireExistingNotCompressedObj) throws IOException - Throws:
IOException
-
parseFileObject
- Throws:
IOException
-
parseObjectStreamObject
Parse the object with the given key from the object stream with the given number.- Parameters:
objstmObjNr- the number of the offset streamkey- the key of the object to be parsed- Returns:
- the parsed object
- Throws:
IOException- if something went wrong when parsing the object
-
getLength
Returns length value referred to or defined in given object.- Throws:
IOException
-
parseCOSStream
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.- Parameters:
dic- dictionary that goes with this stream.- Returns:
- parsed pdf stream.
- Throws:
IOException- if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
-
readUntilEndStream
This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.- Parameters:
out- stream we write out to.- Throws:
IOException- if something went wrong
-
validateStreamLength
- Throws:
IOException
-
checkXRefOffset
Check if the cross reference table/stream can be found at the current offset.- Parameters:
startXRefOffset-- Returns:
- the revised offset
- Throws:
IOException
-
checkXRefStreamOffset
Check if the cross reference stream can be found at the current offset.- Parameters:
startXRefOffset- the expected start offset of the XRef stream- Returns:
- the revised offset
- Throws:
IOException- if something went wrong
-
calculateXRefFixedOffset
Try to find a fixed offset for the given xref table/stream.- Parameters:
objectOffset- the given offset where to look at- Returns:
- the fixed offset
- Throws:
IOException- if something went wrong
-
validateXrefOffsets
- Throws:
IOException
-
checkXrefOffsets
Check the XRef table by dereferencing all objects and fixing the offset if necessary.- Throws:
IOException- if something went wrong.
-
findObjectKey
private COSObjectKey findObjectKey(COSObjectKey objectKey, long offset, Map<COSObjectKey, Long> xrefOffset) throws IOExceptionCheck if the given object can be found at the given offset. Returns the provided object key if everything is ok. If the generation number differs it will be fixed and a new object key is returned.- Parameters:
objectKey- the key of object we are looking foroffset- the offset where to lookxrefOffset- a map with with all known xref entries- Returns:
- returns the found/fixed object key
- Throws:
IOException- if something went wrong
-
getBruteForceParser
- Throws:
IOException
-
checkPages
Check if all entries of the pages dictionary are present. Those which can't be dereferenced are removed.- Parameters:
root- the root dictionary of the pdf- Throws:
IOException- if the page tree root is null
-
checkPagesDictionary
-
parseStartXref
This will parse the startxref section from the stream. The startxref value is ignored.- Returns:
- the startxref value or -1 on parsing error
- Throws:
IOException- If an IO error occurs.
-
isString
Checks if the given string can be found at the current offset.- Parameters:
string- the bytes of the string to look for- Returns:
- true if the bytes are in place, false if not
- Throws:
IOException- if something went wrong
-
isString
Checks if the given string can be found at the current offset.- Parameters:
string- the bytes of the string to look for- Returns:
- true if the bytes are in place, false if not
- Throws:
IOException- if something went wrong
-
parseTrailer
This will parse the trailer from the stream and add it to the state.- Returns:
- false on parsing error
- Throws:
IOException- If an IO error occurs.
-
parsePDFHeader
Parse the header of a pdf.- Returns:
- true if a PDF header was found
- Throws:
IOException- if something went wrong
-
parseFDFHeader
Parse the header of a fdf.- Returns:
- true if a FDF header was found
- Throws:
IOException- if something went wrong
-
parseHeader
- Throws:
IOException
-
parseXrefTable
This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.- Parameters:
startByteOffset- the offset to start at- Returns:
- false on parsing error
- Throws:
IOException- If an IO error occurs.
-
getEncryption
This will get the encryption dictionary. The document must be parsed before this is called.- Returns:
- The encryption dictionary of the document that was parsed.
- Throws:
IOException- If there is an error getting the document.
-
getAccessPermission
This will get the AccessPermission. The document must be parsed before this is called.- Returns:
- The access permission of document that was parsed.
- Throws:
IOException- If there is an error getting the document.
-
prepareDecryption
Prepare for decryption.- Throws:
InvalidPasswordException- If the password is incorrect.IOException- if something went wrong
-