Skip to content

Commit fc0bdc6

Browse files
authored
Merge pull request #225 from pkjagesia/prerna
Added a script to extract text from PDF
2 parents 30857ee + d34860f commit fc0bdc6

File tree

4 files changed

+66
-0
lines changed

4 files changed

+66
-0
lines changed

AUTOMATION/PDF To Text/README.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Extracting text from PDF using Python
2+
3+
Create a new folder and create a pdfToText.py file in it. Copy and paste the code in pdfToText.py in this repository to that file.
4+
5+
Open the Terminal:
6+
7+
```py
8+
pip install pdfminer.six
9+
10+
```
11+
12+
In the same folder, add the pdf from which you want to extract text (Here the pdf used is test.pdf). Provide this pdf as a command line argument.
13+
14+
Run the script using:
15+
16+
```py
17+
python3 pdfToText.py test.pdf
18+
19+
```
20+
21+
The extracted text will be available in converted_pdf.txt
+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
Adobe Acrobat PDF Files
2+
3+
Adobe® Portable Document Format (PDF) is a universal file format that preserves all
4+
of the fonts, formatting, colours and graphics of any source document, regardless of
5+
the application and platform used to create it.
6+
7+
Adobe PDF is an ideal format for electronic document distribution as it overcomes the
8+
problems commonly encountered with electronic file sharing.
9+
10+
• Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
11+
Reader. Recipients of other file formats sometimes can't open files because they
12+
don't have the applications used to create the documents.
13+
14+
• PDF files always print correctly on any printing device.
15+
16+
• PDF files always display exactly as created, regardless of fonts, software, and
17+
operating systems. Fonts, and graphics are not lost due to platform, software, and
18+
version incompatibilities.
19+
20+
• The free Acrobat Reader is easy to download and can be freely distributed by
21+
22+
anyone.
23+
24+
• Compact PDF files are smaller than their source files and download a
25+
26+
page at a time for fast display on the Web.
27+
28+

AUTOMATION/PDF To Text/pdfToText.py

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import argparse
2+
import pdfminer.high_level
3+
4+
# Extract text with Pdfminer.six Module
5+
def With_PdfMiner(pdf):
6+
with open(pdf,'rb') as file_handle_1:
7+
doc = pdfminer.high_level.extract_text(file_handle_1)
8+
9+
with open('converted_pdf.txt','w') as file_handle_2 :
10+
file_handle_2.write(doc)
11+
12+
13+
if __name__ == '__main__':
14+
parser = argparse.ArgumentParser()
15+
parser.add_argument("file", help = "PDF file from which we extract text")
16+
args = parser.parse_args()
17+
With_PdfMiner(args.file)

AUTOMATION/PDF To Text/test.pdf

7.76 KB
Binary file not shown.

0 commit comments

Comments
 (0)