Class LucenePDFDocument
- java.lang.Object
-
- org.apache.pdfbox.examples.lucene.LucenePDFDocument
-
public class LucenePDFDocument extends java.lang.ObjectThis class is used to create a document for the lucene search engine. This should easily plug into the IndexPDFFiles that comes with the lucene project. This class will populate the following fields.Lucene Field Name Description path File system path if loaded from a file url URL to PDF document contents Entire contents of PDF document, indexed but not stored summary First 500 characters of content modified The modified date/time according to the url or path uid A unique identifier for the Lucene document. CreationDate From PDF meta-data if available Creator From PDF meta-data if available Keywords From PDF meta-data if available ModificationDate From PDF meta-data if available Producer From PDF meta-data if available Subject From PDF meta-data if available Trapped From PDF meta-data if available
-
-
Field Summary
Fields Modifier and Type Field Description private static org.apache.lucene.document.DateTools.ResolutionDATE_TIME_RESprivate static charFILE_SEPARATORprivate PDFTextStripperstripperstatic org.apache.lucene.document.FieldTypeTYPE_STORED_NOT_INDEXEDnot Indexed, tokenized, stored.
-
Constructor Summary
Constructors Constructor Description LucenePDFDocument()Constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private voidaddContent(org.apache.lucene.document.Document document, RandomAccessRead source, java.lang.String documentLocation)This will add the contents to the lucene document.private voidaddKeywordField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)private voidaddTextField(org.apache.lucene.document.Document document, java.lang.String name, java.io.Reader value)private voidaddTextField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)private voidaddTextField(org.apache.lucene.document.Document document, java.lang.String name, java.util.Calendar value)private voidaddTextField(org.apache.lucene.document.Document document, java.lang.String name, java.util.Date value)private static voidaddUnindexedField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)private voidaddUnstoredKeywordField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)org.apache.lucene.document.DocumentconvertDocument(java.io.File file)This will take a reference to a PDF document and create a lucene document.org.apache.lucene.document.DocumentconvertDocument(java.net.URL url)Convert the document from a PDF to a lucene document.static java.lang.StringcreateUID(java.io.File file)Create an UID for the given file.static java.lang.StringcreateUID(java.net.URL url, long time)Create an UID for the given file using the given time.static org.apache.lucene.document.DocumentgetDocument(java.io.File file)This will get a lucene document from a PDF file.static org.apache.lucene.document.DocumentgetDocument(java.net.URL url)This will get a lucene document from a PDF file.voidsetTextStripper(PDFTextStripper aStripper)Set the text stripper that will be used during extraction.private static java.lang.StringtimeToString(long time)
-
-
-
Field Detail
-
FILE_SEPARATOR
private static final char FILE_SEPARATOR
-
DATE_TIME_RES
private static final org.apache.lucene.document.DateTools.Resolution DATE_TIME_RES
-
stripper
private PDFTextStripper stripper
-
TYPE_STORED_NOT_INDEXED
public static final org.apache.lucene.document.FieldType TYPE_STORED_NOT_INDEXED
not Indexed, tokenized, stored.
-
-
Method Detail
-
setTextStripper
public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.- Parameters:
aStripper- The new pdf text stripper.
-
timeToString
private static java.lang.String timeToString(long time)
-
addKeywordField
private void addKeywordField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)
-
addTextField
private void addTextField(org.apache.lucene.document.Document document, java.lang.String name, java.io.Reader value)
-
addTextField
private void addTextField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)
-
addTextField
private void addTextField(org.apache.lucene.document.Document document, java.lang.String name, java.util.Date value)
-
addTextField
private void addTextField(org.apache.lucene.document.Document document, java.lang.String name, java.util.Calendar value)
-
addUnindexedField
private static void addUnindexedField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)
-
addUnstoredKeywordField
private void addUnstoredKeywordField(org.apache.lucene.document.Document document, java.lang.String name, java.lang.String value)
-
convertDocument
public org.apache.lucene.document.Document convertDocument(java.io.File file) throws java.io.IOExceptionThis will take a reference to a PDF document and create a lucene document.- Parameters:
file- A reference to a PDF document.- Returns:
- The converted lucene document.
- Throws:
java.io.IOException- If there is an exception while converting the document.
-
convertDocument
public org.apache.lucene.document.Document convertDocument(java.net.URL url) throws java.io.IOExceptionConvert the document from a PDF to a lucene document.- Parameters:
url- A url to a PDF document.- Returns:
- The PDF converted to a lucene document.
- Throws:
java.io.IOException- If there is an error while converting the document.
-
getDocument
public static org.apache.lucene.document.Document getDocument(java.io.File file) throws java.io.IOExceptionThis will get a lucene document from a PDF file.- Parameters:
file- The file to get the document for.- Returns:
- The lucene document.
- Throws:
java.io.IOException- If there is an error parsing or indexing the document.
-
getDocument
public static org.apache.lucene.document.Document getDocument(java.net.URL url) throws java.io.IOExceptionThis will get a lucene document from a PDF file.- Parameters:
url- The file to get the document for.- Returns:
- The lucene document.
- Throws:
java.io.IOException- If there is an error parsing or indexing the document.
-
addContent
private void addContent(org.apache.lucene.document.Document document, RandomAccessRead source, java.lang.String documentLocation) throws java.io.IOExceptionThis will add the contents to the lucene document.- Parameters:
document- The document to add the contents to.source- The source to get the content from.documentLocation- The location of the document, used just for debug messages.- Throws:
java.io.IOException- If there is an error parsing the document.
-
createUID
public static java.lang.String createUID(java.net.URL url, long time)Create an UID for the given file using the given time.- Parameters:
url- the file we have to create an UID fortime- the time to used to the UID- Returns:
- the created UID
-
createUID
public static java.lang.String createUID(java.io.File file)
Create an UID for the given file.- Parameters:
file- the file we have to create an UID for- Returns:
- the created UID
-
-