A Python parser for scientific PDF based on GROBID.
Use pip to install from this Github repository
pip install git+https://github.com/titipata/scipdf_parserNote
- We also need an
en_core_web_smmodel for spacy, where you can runpython -m spacy download en_core_web_smto download it - You can change GROBID version in
serve_grobid.shto test the parser on a new GROBID version
Run the GROBID using the given bash script before parsing PDF.
NOTE: the recommended way to run grobid is via docker, so make sure it's running on your machine. Update the script so that you are using latest version. Generally, at every version there are substantial improvements.
bash serve_grobid.shThis script will run GROBID at default port 8070 (see more here).
To parse a PDF provided in example_data folder or direct URL, use the following function:
import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary
# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)
# output example
>> {
'title': 'Proceedings of Machine Learning for Healthcare',
'abstract': '...',
'sections': [
{'heading': '...', 'text': '...'},
{'heading': '...', 'text': '...'},
...
],
'references': [
{'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
...
],
'figures': [
{'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
...
],
'doi': '...'
}
xml = scipdf.parse_pdf('example_data/futoma2017improved.pdf', soup=True) # option to parse full XML from GROBIDTo parse figures from PDF using pdffigures2, you can run
scipdf.parse_figures('example_data', output_folder='figures') # folder should contain only PDF filesYou can see example output figures in figures folder.