Skip to content

Latest commit

 

History

History
89 lines (58 loc) · 1.96 KB

README.md

File metadata and controls

89 lines (58 loc) · 1.96 KB

PDF information extractor

The aim of this project is to extract informations form a scientific article (PDF format) and put them in an Excel file.

The data will be then transferred to a Neo4j database.

The second part of the project is to find the main topics from a posted abstract.

The project is divided in three main parts

  • info-extractor-app, an app where you can extract information from a PDF or fill up entries in the database
  • model-app, an app written udner Streamlit dedicated to the model's conception
  • article-app, an app exploiting the models directly

Note : this project was designed to support several databases, but due to a time problem, only SQLite is currently supported.

Getting started

Conda (recommended)

conda env create -f environment.yml
conda activate pdf-extraction-env

Pip

Pip is version 20.2.3 when this project was created

pip install -r requirements.txt

Docker

For the app (note: it might not work because of the database inclusion)

cd article-app
docker build -t article-app .
docker run -d --name app-demo -p 5000:5000 article-app

# Stop the container
docker stop app-demo

docker-compose.yml coming soon!

Database

If you're using a SQL database, please run the following command :

cd info-extractor-app
python -c "from server_module import db, app" "with app.app_context(): db.create_all()"

Launch the app

Article app

cd article-app
export FLASK_APP=server.py
python server.py

Model app

cd model-app
streamlit run stapp.py 

Extractor app

cd info-extractor-app
python -c "from server_module import db, app" 
python -c "with app.app_context(): db.create_all()"
export FLASK_APP=server.py
python server.py

Troubleshooting

PDF extraction might not be the best method to get some information such as the ID. The main API would be more useful. Besides, PyPDF2 can have some trouble sorting data properly.