Class LucenePDFDocument


  • public class LucenePDFDocument
    extends java.lang.Object
    This class is used to create a document for the lucene search engine. This should easily plug into the IndexPDFFiles that comes with the lucene project. This class will populate the following fields.
    Lucene Field Name Description
    path File system path if loaded from a file
    url URL to PDF document
    contents Entire contents of PDF document, indexed but not stored
    summary First 500 characters of content
    modified The modified date/time according to the url or path
    uid A unique identifier for the Lucene document.
    CreationDate From PDF meta-data if available
    Creator From PDF meta-data if available
    Keywords From PDF meta-data if available
    ModificationDate From PDF meta-data if available
    Producer From PDF meta-data if available
    Subject From PDF meta-data if available
    Trapped From PDF meta-data if available
    • Constructor Summary

      Constructors 
      Constructor Description
      LucenePDFDocument()
      Constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void addContent​(org.apache.lucene.document.Document document, RandomAccessRead source, java.lang.String documentLocation)
      This will add the contents to the lucene document.
      private void addKeywordField​(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)  
      private void addTextField​(org.apache.lucene.document.Document document, java.lang.String name, java.io.Reader value)  
      private void addTextField​(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)  
      private void addTextField​(org.apache.lucene.document.Document document, java.lang.String name, java.util.Calendar value)  
      private void addTextField​(org.apache.lucene.document.Document document, java.lang.String name, java.util.Date value)  
      private static void addUnindexedField​(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)  
      private void addUnstoredKeywordField​(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)  
      org.apache.lucene.document.Document convertDocument​(java.io.File file)
      This will take a reference to a PDF document and create a lucene document.
      org.apache.lucene.document.Document convertDocument​(java.net.URL url)
      Convert the document from a PDF to a lucene document.
      static java.lang.String createUID​(java.io.File file)
      Create an UID for the given file.
      static java.lang.String createUID​(java.net.URL url, long time)
      Create an UID for the given file using the given time.
      static org.apache.lucene.document.Document getDocument​(java.io.File file)
      This will get a lucene document from a PDF file.
      static org.apache.lucene.document.Document getDocument​(java.net.URL url)
      This will get a lucene document from a PDF file.
      void setTextStripper​(PDFTextStripper aStripper)
      Set the text stripper that will be used during extraction.
      private static java.lang.String timeToString​(long time)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • FILE_SEPARATOR

        private static final char FILE_SEPARATOR
      • DATE_TIME_RES

        private static final org.apache.lucene.document.DateTools.Resolution DATE_TIME_RES
      • TYPE_STORED_NOT_INDEXED

        public static final org.apache.lucene.document.FieldType TYPE_STORED_NOT_INDEXED
        not Indexed, tokenized, stored.
    • Constructor Detail

      • LucenePDFDocument

        public LucenePDFDocument()
        Constructor.
    • Method Detail

      • setTextStripper

        public void setTextStripper​(PDFTextStripper aStripper)
        Set the text stripper that will be used during extraction.
        Parameters:
        aStripper - The new pdf text stripper.
      • timeToString

        private static java.lang.String timeToString​(long time)
      • addKeywordField

        private void addKeywordField​(org.apache.lucene.document.Document document,
                                     java.lang.String name,
                                     java.lang.String value)
      • addTextField

        private void addTextField​(org.apache.lucene.document.Document document,
                                  java.lang.String name,
                                  java.io.Reader value)
      • addTextField

        private void addTextField​(org.apache.lucene.document.Document document,
                                  java.lang.String name,
                                  java.lang.String value)
      • addTextField

        private void addTextField​(org.apache.lucene.document.Document document,
                                  java.lang.String name,
                                  java.util.Date value)
      • addTextField

        private void addTextField​(org.apache.lucene.document.Document document,
                                  java.lang.String name,
                                  java.util.Calendar value)
      • addUnindexedField

        private static void addUnindexedField​(org.apache.lucene.document.Document document,
                                              java.lang.String name,
                                              java.lang.String value)
      • addUnstoredKeywordField

        private void addUnstoredKeywordField​(org.apache.lucene.document.Document document,
                                             java.lang.String name,
                                             java.lang.String value)
      • convertDocument

        public org.apache.lucene.document.Document convertDocument​(java.io.File file)
                                                            throws java.io.IOException
        This will take a reference to a PDF document and create a lucene document.
        Parameters:
        file - A reference to a PDF document.
        Returns:
        The converted lucene document.
        Throws:
        java.io.IOException - If there is an exception while converting the document.
      • convertDocument

        public org.apache.lucene.document.Document convertDocument​(java.net.URL url)
                                                            throws java.io.IOException
        Convert the document from a PDF to a lucene document.
        Parameters:
        url - A url to a PDF document.
        Returns:
        The PDF converted to a lucene document.
        Throws:
        java.io.IOException - If there is an error while converting the document.
      • getDocument

        public static org.apache.lucene.document.Document getDocument​(java.io.File file)
                                                               throws java.io.IOException
        This will get a lucene document from a PDF file.
        Parameters:
        file - The file to get the document for.
        Returns:
        The lucene document.
        Throws:
        java.io.IOException - If there is an error parsing or indexing the document.
      • getDocument

        public static org.apache.lucene.document.Document getDocument​(java.net.URL url)
                                                               throws java.io.IOException
        This will get a lucene document from a PDF file.
        Parameters:
        url - The file to get the document for.
        Returns:
        The lucene document.
        Throws:
        java.io.IOException - If there is an error parsing or indexing the document.
      • addContent

        private void addContent​(org.apache.lucene.document.Document document,
                                RandomAccessRead source,
                                java.lang.String documentLocation)
                         throws java.io.IOException
        This will add the contents to the lucene document.
        Parameters:
        document - The document to add the contents to.
        source - The source to get the content from.
        documentLocation - The location of the document, used just for debug messages.
        Throws:
        java.io.IOException - If there is an error parsing the document.
      • createUID

        public static java.lang.String createUID​(java.net.URL url,
                                                 long time)
        Create an UID for the given file using the given time.
        Parameters:
        url - the file we have to create an UID for
        time - the time to used to the UID
        Returns:
        the created UID
      • createUID

        public static java.lang.String createUID​(java.io.File file)
        Create an UID for the given file.
        Parameters:
        file - the file we have to create an UID for
        Returns:
        the created UID