Merge pull request #225 from pkjagesia/prerna

larymak · web-flow · commit fc0bdc6d0b66 · 2022-10-25T07:48:50.000+03:00
Added a script to extract text from PDF
diff --git a/AUTOMATION/PDF To Text/README.md b/AUTOMATION/PDF To Text/README.md
@@ -0,0 +1,21 @@
+# Extracting text from PDF using Python 
+
+Create a new folder and create a pdfToText.py file in it. Copy and paste the code in pdfToText.py in this repository to that file.
+
+Open the Terminal:
+
+```py
+pip install pdfminer.six
+
+```
+
+In the same folder, add the pdf from which you want to extract text (Here the pdf used is test.pdf). Provide this pdf as a command line argument.
+
+Run the script using:
+
+```py
+python3 pdfToText.py test.pdf
+
+```
+
+The extracted text will be available in converted_pdf.txt
diff --git a/AUTOMATION/PDF To Text/converted_pdf.txt b/AUTOMATION/PDF To Text/converted_pdf.txt
@@ -0,0 +1,28 @@
+Adobe Acrobat PDF Files
+
+Adobe® Portable Document Format (PDF) is a universal file format that preserves all
+of the fonts, formatting, colours and graphics  of any  source document,  regardless of
+the application and platform used to create it.
+
+Adobe PDF is an ideal format for electronic document distribution as it overcomes the
+problems commonly encountered with electronic file sharing.
+
+•  Anyone, anywhere can open a PDF file. All you need is the free Adobe Acrobat
+Reader.  Recipients  of  other  file  formats  sometimes  can't  open  files  because  they
+don't have the applications used to create the documents.
+
+•  PDF files always print correctly on any printing device.
+
+•  PDF  files  always  display  exactly  as  created,  regardless  of  fonts,  software,  and
+operating systems. Fonts, and graphics are not lost due to platform, software, and
+version incompatibilities.
+
+•  The  free  Acrobat  Reader  is  easy  to  download  and  can  be  freely  distributed  by
+
+anyone.
+
+•  Compact  PDF  files  are  smaller  than  their  source  files  and  download  a
+
+page at a time for fast display on the Web.
+
+
diff --git a/AUTOMATION/PDF To Text/pdfToText.py b/AUTOMATION/PDF To Text/pdfToText.py
@@ -0,0 +1,17 @@
+import argparse
+import pdfminer.high_level
+
+# Extract text with Pdfminer.six Module
+def With_PdfMiner(pdf):
+	with open(pdf,'rb') as file_handle_1:
+		doc = pdfminer.high_level.extract_text(file_handle_1)
+
+	with open('converted_pdf.txt','w') as file_handle_2 :
+		file_handle_2.write(doc)
+
+
+if __name__ == '__main__':
+	parser = argparse.ArgumentParser()
+	parser.add_argument("file", help = "PDF file from which we extract text")
+	args = parser.parse_args()
+	With_PdfMiner(args.file)
diff --git a/AUTOMATION/PDF To Text/test.pdf b/AUTOMATION/PDF To Text/test.pdf