Class LucenePDFDocument
java.lang.Object
org.apache.pdfbox.examples.lucene.LucenePDFDocument
This class is used to create a document for the lucene search engine. This should easily plug into the IndexPDFFiles
that comes with the lucene project. This class will populate the following fields.
| Lucene Field Name | Description |
|---|---|
| path | File system path if loaded from a file |
| url | URL to PDF document |
| contents | Entire contents of PDF document, indexed but not stored |
| summary | First 500 characters of content |
| modified | The modified date/time according to the url or path |
| uid | A unique identifier for the Lucene document. |
| CreationDate | From PDF meta-data if available |
| Creator | From PDF meta-data if available |
| Keywords | From PDF meta-data if available |
| ModificationDate | From PDF meta-data if available |
| Producer | From PDF meta-data if available |
| Subject | From PDF meta-data if available |
| Trapped | From PDF meta-data if available |
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final org.apache.lucene.document.DateTools.Resolutionprivate static final charprivate PDFTextStripperstatic final org.apache.lucene.document.FieldTypenot Indexed, tokenized, stored. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddContent(org.apache.lucene.document.Document document, RandomAccessRead source, String documentLocation) This will add the contents to the lucene document.private voidaddKeywordField(org.apache.lucene.document.Document document, String name, String value) private voidaddTextField(org.apache.lucene.document.Document document, String name, Reader value) private voidaddTextField(org.apache.lucene.document.Document document, String name, String value) private voidaddTextField(org.apache.lucene.document.Document document, String name, Calendar value) private voidaddTextField(org.apache.lucene.document.Document document, String name, Date value) private static voidaddUnindexedField(org.apache.lucene.document.Document document, String name, String value) private voidaddUnstoredKeywordField(org.apache.lucene.document.Document document, String name, String value) org.apache.lucene.document.DocumentconvertDocument(File file) This will take a reference to a PDF document and create a lucene document.org.apache.lucene.document.DocumentconvertDocument(URL url) Convert the document from a PDF to a lucene document.static StringCreate an UID for the given file.static StringCreate an UID for the given file using the given time.static org.apache.lucene.document.DocumentgetDocument(File file) This will get a lucene document from a PDF file.static org.apache.lucene.document.DocumentgetDocument(URL url) This will get a lucene document from a PDF file.voidsetTextStripper(PDFTextStripper aStripper) Set the text stripper that will be used during extraction.private static StringtimeToString(long time)
-
Field Details
-
FILE_SEPARATOR
private static final char FILE_SEPARATOR -
DATE_TIME_RES
private static final org.apache.lucene.document.DateTools.Resolution DATE_TIME_RES -
stripper
-
TYPE_STORED_NOT_INDEXED
public static final org.apache.lucene.document.FieldType TYPE_STORED_NOT_INDEXEDnot Indexed, tokenized, stored.
-
-
Constructor Details
-
LucenePDFDocument
public LucenePDFDocument()Constructor.
-
-
Method Details
-
setTextStripper
Set the text stripper that will be used during extraction.- Parameters:
aStripper- The new pdf text stripper.
-
timeToString
-
addKeywordField
-
addTextField
-
addTextField
-
addTextField
-
addTextField
-
addUnindexedField
-
addUnstoredKeywordField
-
convertDocument
This will take a reference to a PDF document and create a lucene document.- Parameters:
file- A reference to a PDF document.- Returns:
- The converted lucene document.
- Throws:
IOException- If there is an exception while converting the document.
-
convertDocument
Convert the document from a PDF to a lucene document.- Parameters:
url- A url to a PDF document.- Returns:
- The PDF converted to a lucene document.
- Throws:
IOException- If there is an error while converting the document.
-
getDocument
This will get a lucene document from a PDF file.- Parameters:
file- The file to get the document for.- Returns:
- The lucene document.
- Throws:
IOException- If there is an error parsing or indexing the document.
-
getDocument
This will get a lucene document from a PDF file.- Parameters:
url- The file to get the document for.- Returns:
- The lucene document.
- Throws:
IOException- If there is an error parsing or indexing the document.
-
addContent
private void addContent(org.apache.lucene.document.Document document, RandomAccessRead source, String documentLocation) throws IOException This will add the contents to the lucene document.- Parameters:
document- The document to add the contents to.source- The source to get the content from.documentLocation- The location of the document, used just for debug messages.- Throws:
IOException- If there is an error parsing the document.
-
createUID
Create an UID for the given file using the given time.- Parameters:
url- the file we have to create an UID fortime- the time to used to the UID- Returns:
- the created UID
-
createUID
Create an UID for the given file.- Parameters:
file- the file we have to create an UID for- Returns:
- the created UID
-