Class COSParser

java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.COSParser
All Implemented Interfaces:
ICOSParser
Direct Known Subclasses:
BruteForceParser, FDFParser, PDFParser

public class COSParser extends BaseParser implements ICOSParser
COS-Parser which first reads startxref and xref tables in order to know valid objects and parse only these objects. This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.
  • Field Details

    • PDF_HEADER

      private static final String PDF_HEADER
      See Also:
    • FDF_HEADER

      private static final String FDF_HEADER
      See Also:
    • PDF_DEFAULT_VERSION

      private static final String PDF_DEFAULT_VERSION
      See Also:
    • FDF_DEFAULT_VERSION

      private static final String FDF_DEFAULT_VERSION
      See Also:
    • XREF_TABLE

      private static final char[] XREF_TABLE
    • STARTXREF

      private static final char[] STARTXREF
    • ENDSTREAM

      private static final byte[] ENDSTREAM
    • ENDOBJ

      private static final byte[] ENDOBJ
    • MINIMUM_SEARCH_OFFSET

      private static final long MINIMUM_SEARCH_OFFSET
      See Also:
    • X

      private static final int X
      See Also:
    • STRMBUFLEN

      private static final int STRMBUFLEN
      See Also:
    • strmBuf

      private final byte[] strmBuf
    • accessPermission

      private AccessPermission accessPermission
    • keyStoreInputStream

      private InputStream keyStoreInputStream
    • password

      private String password
    • keyAlias

      private String keyAlias
    • SYSPROP_EOFLOOKUPRANGE

      public static final String SYSPROP_EOFLOOKUPRANGE
      The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.
      See Also:
    • DEFAULT_TRAIL_BYTECOUNT

      private static final int DEFAULT_TRAIL_BYTECOUNT
      How many trailing bytes to read for EOF marker.
      See Also:
    • EOF_MARKER

      protected static final char[] EOF_MARKER
      EOF-marker.
    • OBJ_MARKER

      protected static final char[] OBJ_MARKER
      obj-marker.
    • fileLen

      protected long fileLen
      file length.
    • isLenient

      private boolean isLenient
      is parser using auto healing capacity ?
    • initialParseDone

      protected boolean initialParseDone
    • trailerWasRebuild

      private boolean trailerWasRebuild
    • bruteForceParser

      private BruteForceParser bruteForceParser
    • encryption

      private PDEncryption encryption
    • decompressedObjects

      private final Map<Long,Map<COSObjectKey,COSBase>> decompressedObjects
      Intermediate cache. Contains all objects of already read compressed object streams. Objects are removed after dereferencing them.
    • securityHandler

      protected SecurityHandler<? extends ProtectionPolicy> securityHandler
      The security handler.
    • readTrailBytes

      private int readTrailBytes
      how many trailing bytes to read for EOF marker.
    • LOG

      private static final org.apache.commons.logging.Log LOG
    • xrefTrailerResolver

      protected XrefTrailerResolver xrefTrailerResolver
      Collects all Xref/trailer objects and resolves them into single object using startxref reference.
  • Constructor Details

    • COSParser

      public COSParser(RandomAccessRead source) throws IOException
      Default constructor.
      Parameters:
      source - input representing the pdf.
      Throws:
      IOException - if something went wrong
    • COSParser

      public COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias) throws IOException
      Constructor for encrypted pdfs.
      Parameters:
      source - input representing the pdf.
      password - password to be used for decryption.
      keyStore - key store to be used for decryption when using public key security
      keyAlias - alias to be used for decryption when using public key security
      Throws:
      IOException - if the source data could not be read
    • COSParser

      public COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias, RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction) throws IOException
      Constructor for encrypted pdfs.
      Parameters:
      source - input representing the pdf.
      password - password to be used for decryption.
      keyStore - key store to be used for decryption when using public key security
      keyAlias - alias to be used for decryption when using public key security
      streamCacheCreateFunction - a function to create an instance of the stream cache
      Throws:
      IOException - if the source data could not be read
  • Method Details

    • init

      private void init(RandomAccessStreamCache.StreamCacheCreateFunction streamCacheCreateFunction)
    • setEOFLookupRange

      public void setEOFLookupRange(int byteCount)
      Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

      We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

      In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

      Parameters:
      byteCount - number of trailing bytes
    • retrieveTrailer

      protected COSDictionary retrieveTrailer() throws IOException
      Read the trailer information and provide a COSDictionary containing the trailer information.
      Returns:
      a COSDictionary containing the trailer information
      Throws:
      IOException - if something went wrong
    • resetTrailerResolver

      protected boolean resetTrailerResolver()
      Indicates whether the xref trailer resolver should be reset or not. Should be overwritten if the xref trailer resolver is needed after the initial parsing.
      Returns:
      true if the xref trailer resolver should be reset
    • parseXref

      private COSDictionary parseXref(long startXRefOffset) throws IOException
      Parses cross reference tables.
      Parameters:
      startXRefOffset - start offset of the first table
      Returns:
      the trailer dictionary
      Throws:
      IOException - if something went wrong
    • parseXrefObjStream

      private long parseXrefObjStream(long objByteOffset, boolean isStandalone) throws IOException
      Parses an xref object stream starting with indirect object id.
      Returns:
      value of PREV item in dictionary or -1 if no such item exists
      Throws:
      IOException
    • getStartxrefOffset

      private long getStartxrefOffset() throws IOException
      Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
      Returns:
      the offset of StartXref
      Throws:
      IOException - If something went wrong.
    • lastIndexOf

      protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
      Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
      Parameters:
      pattern - pattern to search for
      buf - buffer to search pattern in
      endOff - offset (exclusive) where lookup starts at
      Returns:
      start offset of pattern within buffer or -1 if pattern could not be found
    • isLenient

      public boolean isLenient()
      Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
      Returns:
      true if parser is lenient
    • setLenient

      protected void setLenient(boolean lenient)
      Change the parser leniency flag. This method can only be called before the parsing of the file.
      Parameters:
      lenient - try to handle malformed PDFs.
    • dereferenceCOSObject

      public COSBase dereferenceCOSObject(COSObject obj) throws IOException
      Description copied from interface: ICOSParser
      Dereference the COSBase object which is referenced by the given COSObject.
      Specified by:
      dereferenceCOSObject in interface ICOSParser
      Parameters:
      obj - the COSObject which references the COSBase object to be dereferenced.
      Returns:
      the referenced object
      Throws:
      IOException - if something went wrong when dereferencing the COSBase object
    • createRandomAccessReadView

      public RandomAccessReadView createRandomAccessReadView(long startPosition, long streamLength) throws IOException
      Description copied from interface: ICOSParser
      Creates a random access read view starting at the given position with the given length.
      Specified by:
      createRandomAccessReadView in interface ICOSParser
      Parameters:
      startPosition - start position within the underlying random access read
      streamLength - stream length
      Returns:
      the random access read view
      Throws:
      IOException - if something went wrong when creating the view for the RandomAccessRead
    • parseObjectDynamically

      protected COSBase parseObjectDynamically(COSObjectKey objKey, boolean requireExistingNotCompressedObj) throws IOException
      Parse the object for the given object key.
      Parameters:
      objKey - key of object to be parsed
      requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
      Returns:
      the parsed object (which is also added to document object)
      Throws:
      IOException - If an IO error occurs.
    • getObjectOffset

      private Long getObjectOffset(COSObjectKey objKey, boolean requireExistingNotCompressedObj) throws IOException
      Throws:
      IOException
    • parseFileObject

      private COSBase parseFileObject(Long objOffset, COSObjectKey objKey) throws IOException
      Throws:
      IOException
    • parseObjectStreamObject

      protected COSBase parseObjectStreamObject(long objstmObjNr, COSObjectKey key) throws IOException
      Parse the object with the given key from the object stream with the given number.
      Parameters:
      objstmObjNr - the number of the offset stream
      key - the key of the object to be parsed
      Returns:
      the parsed object
      Throws:
      IOException - if something went wrong when parsing the object
    • getLength

      private COSNumber getLength(COSBase lengthBaseObj) throws IOException
      Returns length value referred to or defined in given object.
      Throws:
      IOException
    • parseCOSStream

      protected COSStream parseCOSStream(COSDictionary dic) throws IOException
      This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
      Parameters:
      dic - dictionary that goes with this stream.
      Returns:
      parsed pdf stream.
      Throws:
      IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
    • readUntilEndStream

      private long readUntilEndStream(EndstreamFilterStream out) throws IOException
      This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.
      Parameters:
      out - stream we write out to.
      Throws:
      IOException - if something went wrong
    • validateStreamLength

      private boolean validateStreamLength(long streamLength) throws IOException
      Throws:
      IOException
    • checkXRefOffset

      private long checkXRefOffset(long startXRefOffset) throws IOException
      Check if the cross reference table/stream can be found at the current offset.
      Parameters:
      startXRefOffset -
      Returns:
      the revised offset
      Throws:
      IOException
    • checkXRefStreamOffset

      private boolean checkXRefStreamOffset(long startXRefOffset) throws IOException
      Check if the cross reference stream can be found at the current offset.
      Parameters:
      startXRefOffset - the expected start offset of the XRef stream
      Returns:
      the revised offset
      Throws:
      IOException - if something went wrong
    • calculateXRefFixedOffset

      private long calculateXRefFixedOffset(long objectOffset) throws IOException
      Try to find a fixed offset for the given xref table/stream.
      Parameters:
      objectOffset - the given offset where to look at
      Returns:
      the fixed offset
      Throws:
      IOException - if something went wrong
    • validateXrefOffsets

      private boolean validateXrefOffsets(Map<COSObjectKey,Long> xrefOffset) throws IOException
      Throws:
      IOException
    • checkXrefOffsets

      private void checkXrefOffsets() throws IOException
      Check the XRef table by dereferencing all objects and fixing the offset if necessary.
      Throws:
      IOException - if something went wrong.
    • findObjectKey

      private COSObjectKey findObjectKey(COSObjectKey objectKey, long offset, Map<COSObjectKey,Long> xrefOffset) throws IOException
      Check if the given object can be found at the given offset. Returns the provided object key if everything is ok. If the generation number differs it will be fixed and a new object key is returned.
      Parameters:
      objectKey - the key of object we are looking for
      offset - the offset where to look
      xrefOffset - a map with with all known xref entries
      Returns:
      returns the found/fixed object key
      Throws:
      IOException - if something went wrong
    • getBruteForceParser

      private BruteForceParser getBruteForceParser() throws IOException
      Throws:
      IOException
    • checkPages

      protected void checkPages(COSDictionary root) throws IOException
      Check if all entries of the pages dictionary are present. Those which can't be dereferenced are removed.
      Parameters:
      root - the root dictionary of the pdf
      Throws:
      IOException - if the page tree root is null
    • checkPagesDictionary

      private int checkPagesDictionary(COSDictionary pagesDict, Set<COSObject> set)
    • parseStartXref

      private long parseStartXref() throws IOException
      This will parse the startxref section from the stream. The startxref value is ignored.
      Returns:
      the startxref value or -1 on parsing error
      Throws:
      IOException - If an IO error occurs.
    • isString

      private boolean isString(byte[] string) throws IOException
      Checks if the given string can be found at the current offset.
      Parameters:
      string - the bytes of the string to look for
      Returns:
      true if the bytes are in place, false if not
      Throws:
      IOException - if something went wrong
    • isString

      protected boolean isString(char[] string) throws IOException
      Checks if the given string can be found at the current offset.
      Parameters:
      string - the bytes of the string to look for
      Returns:
      true if the bytes are in place, false if not
      Throws:
      IOException - if something went wrong
    • parseTrailer

      private boolean parseTrailer() throws IOException
      This will parse the trailer from the stream and add it to the state.
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • parsePDFHeader

      protected boolean parsePDFHeader() throws IOException
      Parse the header of a pdf.
      Returns:
      true if a PDF header was found
      Throws:
      IOException - if something went wrong
    • parseFDFHeader

      protected boolean parseFDFHeader() throws IOException
      Parse the header of a fdf.
      Returns:
      true if a FDF header was found
      Throws:
      IOException - if something went wrong
    • parseHeader

      private boolean parseHeader(String headerMarker, String defaultVersion) throws IOException
      Throws:
      IOException
    • parseXrefTable

      protected boolean parseXrefTable(long startByteOffset) throws IOException
      This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
      Parameters:
      startByteOffset - the offset to start at
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • getEncryption

      protected PDEncryption getEncryption() throws IOException
      This will get the encryption dictionary. The document must be parsed before this is called.
      Returns:
      The encryption dictionary of the document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • getAccessPermission

      protected AccessPermission getAccessPermission() throws IOException
      This will get the AccessPermission. The document must be parsed before this is called.
      Returns:
      The access permission of document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • prepareDecryption

      protected void prepareDecryption() throws IOException
      Prepare for decryption.
      Throws:
      InvalidPasswordException - If the password is incorrect.
      IOException - if something went wrong