Textractor is a Python application that converts non-searchable PDFs into searchable PDFs using Optical Character Recognition (OCR) with support for both German and English text. It integrates spell-checking for extracted text using dictionaries and allows batch processing of PDF files through a user-friendly graphical interface (GUI).
Have you ever needed to:
- Copy Text from a PDF File?
- Search for PDF files containing specific words?
- Organize you document by their content?
Textractor solves these problems by scanning you PDF files in s specific PATh, extract the texts and spell check them before storing new searchable versions of the PDF files. Don't worry, PDF files with selectable texts will be ignored!
📄 Converts non-searchable PDFs into searchable ones.
🔍 Integrates OCR to extract text from PDF images.
📝 Spell-checks the extracted text using German and English dictionaries.
🖼️ Supports batch processing of PDFs within folders.
🔧 Easy-to-use GUI for folder selection and processing.
📂 Allows selecting custom output directories for processed files.
🗃️ Respects your folder structure by mirroring the original folder structure from the source folder.
1. Install Tesseract OCR
A- For Windows: - Download and install the least release of Tesseract OCR Windows installer from UB Mannheim here.
B- For MacOS: If you have Homebrew installed, you can install Poppler by running the following command
brew brew install tesseract
C- For Linux (Ubuntu/Debian): Install Tesseract via APT
sudo apt-get install tesseract-ocr
For installation on other os refer to the original download page of Tesseract OCR
During installation, note the installation path. You will need to update the path in the code.
Verify Installation by running tesseract -v
2. Update the Path in Code
In the Textractor code, update the following line with the correct path to the Tesseract-OCR executable (make sure the path is correct):
pytesseract.pytesseract.tesseract_cmd = r\'C:\Path\To\Tesseract-OCR\tesseract.exe\'
On other OS (Not Windows)
pytesseract.pytesseract.tesseract_cmd = 'Path/To/Tesseract-OCR/executable/'
3. Download and Install Poppler
A- For Windows:
Download Poppler binaries from Owen Schwartz here.
Add PATH variables for Poppler to accordingly to poppler\bin
B- For MacOS:
If you have Homebrew installed, you can install Poppler by running the following command
brew install poppler
C- For Linux (Ubuntu/Debian):
sudo apt-get install poppler-utils
Verify Installation by running pdftoppm -v
4. Install Required Python Packages
Run the following command to install all necessary dependencies:
pip install pdf2image pytesseract tkinter pyenchant
5. Hunspell Dictionaries Setup
To ensure the spell-check functionality works:
A. Install Hunspell Dictionaries: Download and install Hunspell dictionaries for German (de_DE) and English (en_US) languages. You can find the dictionaries here.
B. Update the Dictionary Path in the code: Update the path to the dictionaries in the following line:
enchant.set_param(\"enchant.myspell.dictionary.path\", r\'C:\Path\To\Hunspell\Dictionaries\')
On other OS (Not Windows)
enchant.set_param(\"enchant.myspell.dictionary.path\", 'Path/To/Hunspell/Dictionaries/')'
6. Update Environment PATH Variables
Ensure that Tesseract OCR, Hunspell, and Poppler paths are included in your system's PATH environment variables to allow the software to locate them during execution.
1. Launch the Application
Simply run the Python script using the following command:
python Textractor.py
2. Select Input and Output Folders
A GUI will appear where you can select the folder containing the PDFfiles to process.
Use the "Browse" button to select both Input and Output folders:
Input Folder: Contains the non-searchable PDF files.
Output Folder: Where the processed searchable PDFs will be saved.
3. Start the PDF Conversion
Once folders are selected, click the "Start Conversion" button.
The app will process each PDF in the input folder, check if it's non-searchable, and convert it to a searchable PDF if needed.
4. Review the Conversion Log
- After the conversion process is complete, a conversion log will be generated in the output folder (conversion_log.txt), containing details of the processing results, such as errors or successfully processed files.
Ensure that Tesseract OCR and Hunspell dictionaries are correctly installed and their paths are configured in the code and system PATH.
- The app currently supports OCR for both German and English languages. To add support for additional languages, update the OCR command in the code:
raw_text = pytesseract.image_to_string(image, lang=\"deu+eng\")
For optimal performance, ensure your PDF files have a high-resolution image for accurate OCR.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Enjoy using Textractor to make your PDF files searchable and error-free with integrated spell-checking
- Thanks to UB Mannheim for making Tesseract OCR Windows installer