Class LucenePDFDocument

java.lang.Object
org.apache.pdfbox.examples.lucene.LucenePDFDocument

public class LucenePDFDocument extends Object
This class is used to create a document for the lucene search engine. This should easily plug into the IndexPDFFiles that comes with the lucene project. This class will populate the following fields.
Lucene Field Name Description
path File system path if loaded from a file
url URL to PDF document
contents Entire contents of PDF document, indexed but not stored
summary First 500 characters of content
modified The modified date/time according to the url or path
uid A unique identifier for the Lucene document.
CreationDate From PDF meta-data if available
Creator From PDF meta-data if available
Keywords From PDF meta-data if available
ModificationDate From PDF meta-data if available
Producer From PDF meta-data if available
Subject From PDF meta-data if available
Trapped From PDF meta-data if available
  • Field Details

    • FILE_SEPARATOR

      private static final char FILE_SEPARATOR
    • DATE_TIME_RES

      private static final org.apache.lucene.document.DateTools.Resolution DATE_TIME_RES
    • stripper

      private PDFTextStripper stripper
    • TYPE_STORED_NOT_INDEXED

      public static final org.apache.lucene.document.FieldType TYPE_STORED_NOT_INDEXED
      not Indexed, tokenized, stored.
  • Constructor Details

    • LucenePDFDocument

      public LucenePDFDocument()
      Constructor.
  • Method Details

    • setTextStripper

      public void setTextStripper(PDFTextStripper aStripper)
      Set the text stripper that will be used during extraction.
      Parameters:
      aStripper - The new pdf text stripper.
    • timeToString

      private static String timeToString(long time)
    • addKeywordField

      private void addKeywordField(org.apache.lucene.document.Document document, String name, String value)
    • addTextField

      private void addTextField(org.apache.lucene.document.Document document, String name, Reader value)
    • addTextField

      private void addTextField(org.apache.lucene.document.Document document, String name, String value)
    • addTextField

      private void addTextField(org.apache.lucene.document.Document document, String name, Date value)
    • addTextField

      private void addTextField(org.apache.lucene.document.Document document, String name, Calendar value)
    • addUnindexedField

      private static void addUnindexedField(org.apache.lucene.document.Document document, String name, String value)
    • addUnstoredKeywordField

      private void addUnstoredKeywordField(org.apache.lucene.document.Document document, String name, String value)
    • convertDocument

      public org.apache.lucene.document.Document convertDocument(File file) throws IOException
      This will take a reference to a PDF document and create a lucene document.
      Parameters:
      file - A reference to a PDF document.
      Returns:
      The converted lucene document.
      Throws:
      IOException - If there is an exception while converting the document.
    • convertDocument

      public org.apache.lucene.document.Document convertDocument(URL url) throws IOException
      Convert the document from a PDF to a lucene document.
      Parameters:
      url - A url to a PDF document.
      Returns:
      The PDF converted to a lucene document.
      Throws:
      IOException - If there is an error while converting the document.
    • getDocument

      public static org.apache.lucene.document.Document getDocument(File file) throws IOException
      This will get a lucene document from a PDF file.
      Parameters:
      file - The file to get the document for.
      Returns:
      The lucene document.
      Throws:
      IOException - If there is an error parsing or indexing the document.
    • getDocument

      public static org.apache.lucene.document.Document getDocument(URL url) throws IOException
      This will get a lucene document from a PDF file.
      Parameters:
      url - The file to get the document for.
      Returns:
      The lucene document.
      Throws:
      IOException - If there is an error parsing or indexing the document.
    • addContent

      private void addContent(org.apache.lucene.document.Document document, RandomAccessRead source, String documentLocation) throws IOException
      This will add the contents to the lucene document.
      Parameters:
      document - The document to add the contents to.
      source - The source to get the content from.
      documentLocation - The location of the document, used just for debug messages.
      Throws:
      IOException - If there is an error parsing the document.
    • createUID

      public static String createUID(URL url, long time)
      Create an UID for the given file using the given time.
      Parameters:
      url - the file we have to create an UID for
      time - the time to used to the UID
      Returns:
      the created UID
    • createUID

      public static String createUID(File file)
      Create an UID for the given file.
      Parameters:
      file - the file we have to create an UID for
      Returns:
      the created UID