Skip to content

Latest commit

 

History

History
151 lines (109 loc) · 7.66 KB

README.md

File metadata and controls

151 lines (109 loc) · 7.66 KB

extract-pdf-notes

docker build docker build

Often times we read technical/non-technical books and do lot of highlighting on it, even add some notes like meaning of some new word or out opinion about the text. I was searching for way to getting the notes out of the PDF - there are so many options

  • Sumnotes.net (good service, but have to pay for extracting more notes)
  • existing PDF tools like Adobe Reader (have to first configure to add highlighted text to the comments before you even start highlighting), is also good but you have to pre-configure it
  • research assistants like Zotero (can't move notes out of it if you are going to store it somewhere outside)
  • and many more..

But they had their own limitations - and I had some specific requirement (getting the reference images along-with the text). Reading pdfs on android/linux/windows etc so definitely using different apps on each (I can use Foxit on all, but there are better options on android to make my notes easily), so needed all info stored in PDF extractable without paying huge amount to some service.

There is a question asked by stackexchage user for this same purpose. Refer to - https://stackoverflow.com/questions/21050551/extracting-text-from-higlighted-text-using-poppler-qt4-python-poppler-qt4

I've just update the script with additional code to extract highlighted images (with Foxit draw box around image - just keep the border thin) and made directly usable without installing poppler-qt5 on your system. So just use it whenever needed and throw away when done - without keeping any footprints (libraries/dependencies that you don't require, etc) on your system

There is already a gist available, just making it somewhat easy to use with Docker container or vagrant config, so it can be used without installing dependencies of it.

Here you have following ways to execute the script

  • Build dependencies on your system major requirement is python3-poppler-qt5, PyQt5 (tested on ubuntu bionic-18.04)
  • Using docker image
  • Creating your own VM using Vagrantfile and then using the script

Using online docker image

To use my latest docker-image from the dockerhub. This one is easiest among the available methods.

Requirements

  • You need Docker installed and running on the system (link to install provided above)

Usage

  • Navigate to the directory with the highlighted pdf.
  • With docker commands, (Assuming that you have a pdf Sample Book.pdf in the current directory)
    • Start the container from the image as follows

      docker run -v "${PWD}":/notes --user $(id -u):$(id -u)  vsukt/extract_pdf_notes:latest "Sample Book.pdf"
      
    • This will mount the current directory inside the container as /notes (which is working directory of the container)

    • The name of pdf will be an argument to the script which will start printing highlighted text on the command line - Text annotations/Typewriter/Comments - Highlights/Underline/Strikeout. etc

    • It will also print the name of the image file as a part of output text and will create PNG files for any Geometric annotations

      This can be useful to extract image files from the pdf - you cna then insert these images where you'll be keeping your final notes.

      NOTE: Current resolution is 150p. If you want to change it, use the local docker image method mentioned above

    • You can save the text output by redirecting it to another file e.g.

      docker run -v "${PWD}":/notes --user $(id -u):$(id -u) vsukt/extract_pdf_notes:latest "Sample Book.pdf" >"Sample Book.txt"
      
  • or just grab the extract_notes.sh and execute it with PDF file name as argument.
    mkdir -p ~/.local/bin
    curl -sSL https://raw.githubusercontent.com/v-sukt/extract-pdf-notes/master/extract_notes.sh -o ~/.local/bin/extract_notes && chmod a+x ~/.local/bin/extract_notes
    echo $PATH | grep ~/.local/bin > /dev/null || echo "export PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc && source ~/.bashrc
    extract_notes "Sample File.pdf"
    It'll create a local directory with filename_Notes and keep all your notes in there. If you get bash: curl: command not found then you'll have to install curl or use wget instead of curl.

Using local docker image

Requirements:

Building the local image

Usage

  • Navigate to the directory with the highlighted pdf
  • Assuming that you've a pdf Sample Book.pdf in the current directory
    • Start the container from image as follows

       docker run -v "${PWD}":/notes extract_notes:0.6 "Sample Book.pdf"
      
    • This will mount the current directory inside container as /notes (which is working directory of the container)

    • The name of pdf will be argument to the script which will start printing highlighted text on commandline

      • Text annotations/Typewriter/Comments
      • Highlights/Underline/Strikeout. etc
    • It will also print name of the image file as a part of output text and will create a PNG file for any Geometric annotations

      This can be useful to extact image files from the pdf

      NOTE: Current resolution is 150p. If you want to change it change the variable resolution in extract_pdf_notes.py file)

    • You can save the text output by redirecting it to another file e.g.

      docker run -v "${PWD}":/notes extract_notes:0.6 "Sample Book.pdf" >"Sample Book.txt"
      

Using local VM (Vagrant+Virtualbox)

This method uses Vagrantfile for creating a VM for you. Which will also mount current directory inside VM at /vagrant - where you can extract your notes by placing the file in it.

Requirements

Starting the VM

Usage

  • Copy the highlighted pdf to the repo directory
  • Assuming that you've a pdf Sample Book.pdf in the current directory
    • Get inside the VM with

       vagrant ssh
      
    • The VM has mounted the current directory in at /vagrant - so anything that happens inside will be reflected in the repo directory

    • To execute the script use

      python3 extract_pdf_notes.py 'Sample Book.pdf'
      
    • The name of pdf will be argument to the script which will start printing highlighted text on commandline

      • Text annotations/Typewriter/Comments
      • Highlights/Underline/Strikeout. etc
    • It will also print name of the image file as a part of output text and will create a PNG file for any Geometric annotations

      This can be useful to extract image files from the pdf

      NOTE: Current resolution is 150p. If you want to change it change the variable resolution in extract_pdf_notes.py file)

    • You can save the text output by redirecting it to another file e.g.

      python3 extract_pdf_notes.py "Sample Book.pdf" >"Sample Book.txt"
      

Hope this was useful !!