📝 Textractor

Textractor is a Python application that converts non-searchable PDFs into searchable PDFs using Optical Character Recognition (OCR) with support for both German and English text. It integrates spell-checking for extracted text using dictionaries and allows batch processing of PDF files through a user-friendly graphical interface (GUI).

🎯 Purpose

Have you ever needed to:

Copy Text from a PDF File?
Search for PDF files containing specific words?
Organize you document by their content?

Textractor solves these problems by scanning you PDF files in s specific PATh, extract the texts and spell check them before storing new searchable versions of the PDF files. Don't worry, PDF files with selectable texts will be ignored!

💡 Features

📄 Converts non-searchable PDFs into searchable ones.
🔍 Integrates OCR to extract text from PDF images.
📝 Spell-checks the extracted text using German and English dictionaries.
🖼️ Supports batch processing of PDFs within folders.
🔧 Easy-to-use GUI for folder selection and processing.
📂 Allows selecting custom output directories for processed files.
🗃️ Respects your folder structure by mirroring the original folder structure from the source folder.

🛠️ Installation

1. Install Tesseract OCR

A- For Windows: - Download and install the least release of Tesseract OCR Windows installer from UB Mannheim here.

B- For MacOS: If you have Homebrew installed, you can install Poppler by running the following command

brew brew install tesseract

C- For Linux (Ubuntu/Debian): Install Tesseract via APT

sudo apt-get install tesseract-ocr

For installation on other os refer to the original download page of Tesseract OCR
During installation, note the installation path. You will need to update the path in the code.

Verify Installation by running tesseract -v

2. Update the Path in Code

In the Textractor code, update the following line with the correct path to the Tesseract-OCR executable (make sure the path is correct):

pytesseract.pytesseract.tesseract_cmd = r\'C:\Path\To\Tesseract-OCR\tesseract.exe\'

On other OS (Not Windows)

pytesseract.pytesseract.tesseract_cmd = 'Path/To/Tesseract-OCR/executable/'

3. Download and Install Poppler

A- For Windows:

Download Poppler binaries from Owen Schwartz here.

Add PATH variables for Poppler to accordingly to poppler\bin

B- For MacOS:

If you have Homebrew installed, you can install Poppler by running the following command

brew install poppler

C- For Linux (Ubuntu/Debian):

sudo apt-get install poppler-utils

Verify Installation by running pdftoppm -v

4. Install Required Python Packages

Run the following command to install all necessary dependencies:

pip install pdf2image pytesseract tkinter pyenchant

5. Hunspell Dictionaries Setup

To ensure the spell-check functionality works:

A. Install Hunspell Dictionaries: Download and install Hunspell dictionaries for German (de_DE) and English (en_US) languages. You can find the dictionaries here.

B. Update the Dictionary Path in the code: Update the path to the dictionaries in the following line:

enchant.set_param(\"enchant.myspell.dictionary.path\", r\'C:\Path\To\Hunspell\Dictionaries\')

On other OS (Not Windows)

enchant.set_param(\"enchant.myspell.dictionary.path\", 'Path/To/Hunspell/Dictionaries/')'

6. Update Environment PATH Variables

Ensure that Tesseract OCR, Hunspell, and Poppler paths are included in your system's PATH environment variables to allow the software to locate them during execution.

🚀 How to Use

1. Launch the Application

Simply run the Python script using the following command:

python Textractor.py

2. Select Input and Output Folders

A GUI will appear where you can select the folder containing the PDFfiles to process.
Use the "Browse" button to select both Input and Output folders:
Input Folder: Contains the non-searchable PDF files.
Output Folder: Where the processed searchable PDFs will be saved.

3. Start the PDF Conversion

Once folders are selected, click the "Start Conversion" button.
The app will process each PDF in the input folder, check if it's non-searchable, and convert it to a searchable PDF if needed.

4. Review the Conversion Log

After the conversion process is complete, a conversion log will be generated in the output folder (conversion_log.txt), containing details of the processing results, such as errors or successfully processed files.

🛠️ Dependencies

📝 Additional Notes

Ensure that Tesseract OCR and Hunspell dictionaries are correctly installed and their paths are configured in the code and system PATH.
- The app currently supports OCR for both German and English languages. To add support for additional languages, update the OCR command in the code:
```
raw_text = pytesseract.image_to_string(image, lang=\"deu+eng\")
```
For optimal performance, ensure your PDF files have a high-resolution image for accurate OCR.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📜 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Enjoy using Textractor to make your PDF files searchable and error-free with integrated spell-checking

🙏 Acknowledgments

Thanks to UB Mannheim for making Tesseract OCR Windows installer

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Textractor.py		Textractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Textractor

🎯 Purpose

💡 Features

🛠️ Installation

🚀 How to Use

🛠️ Dependencies

📝 Additional Notes

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Releases

Packages

Languages

License

morellovich/textractor

Folders and files

Latest commit

History

Repository files navigation

📝 Textractor

🎯 Purpose

💡 Features

🛠️ Installation

🚀 How to Use

🛠️ Dependencies

📝 Additional Notes

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages