Metadata-Version: 2.4
Name: pdf-utils
Version: 0.1.1
Summary: tools for reading and processing pdf content
Home-page: https://bitbucket.kendaya.net/projects/KXLAB/repos/pdf-tools/
Author: Kendaxa
Author-email: develop@kendaxa.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: lxml~=4.5.1
Requires-Dist: numpy~=1.19.0
Requires-Dist: Pillow~=7.2.0
Requires-Dist: pdf2image~=1.13.1
Requires-Dist: PyPDF2~=1.26.0
Requires-Dist: opencv-python~=4.2.0
Requires-Dist: pytesseract~=0.3.4
Requires-Dist: reportlab~=3.5.44
Requires-Dist: scipy~=1.5.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

## Tools for processing pdf files

This is a light-weighted library for processing pdf files in python.
One of the use-cases might be the extraction of pdf-annotations for ML / NLP.

Support for

* obtaining textual and vizual content of pdf files
* locating positions of words
* fetching pdf annotations
* adding a digital layer to image-pdfs
* re-creating a clean pdf file with annotations removed


## Dependencies

Main tools for reading pdf files are the PyPDF2 library. Non-python dependencies are
 
* [Poppler](https://poppler.freedesktop.org/),
* [Tesseract](https://tesseract-ocr.github.io/tessdoc/Home.html), and 
* [OpenCV](https://opencv.org/).

To install Poppler, see the guide in the [pdf2image readme](https://pypi.org/project/pdf2image/).

## How to

Some examples of usage are shown in the [notebook](./notebook/Demo.ipynb).

## Todo

* Add detection of page-orientation (upside-down, rotated,...) based on images.
* Add some of our experiments with "naive" table detection
* Get rid of PyPDF2 as [it is not maintained](https://stackoverflow.com/questions/63199763/maintained-alternatives-to-pypdf2); replace by PyMUPdf or pdfMiner.six.
