A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.
- python environment
- chrome driver
- pdf2txt
- get all search result urls and their corresponding number of search results
- dump these into scrap_arvix.ipynb and get a clean duplicate free list of urls of individual papers
- get all pdfs
- scarp data
- manula quality check
- get all search result urls and their corresponding number of search results
- dump these into scrap_biorxiv.ipynb and get a clean duplicate free list of urls of individual papers
- now scarp - this step includes getting the pdf too.
- manual quality check
- search pubmed and download results as .csv into raw_result folder
- use scrap_pubmed.ipynb to combine all csv's, remove duplicates and finally scrap it (no pdfs)
- save as .csv and do manual search quality check
- get content pdfs from springer
- 2014 and 2015 - get urls manually
- 2016 and 2017 - pdfs have urls in them
- run getMiccaiUrls.py to get the urls in the pdf and dump them in a list as a .npy file
- read these in scrap_miccai.ipynb and add to them those hardcoded from 2014 and 2015
- run scrap_miccai.ipynb (no pdfs)
- go to http://ieeexplore.ieee.org/Xplore/home.jsp
- enter keywords and download .csv. Link to search will be in the first row.
- combine and clean the multiple downloaded .csv's using combine_Ieee.ipynb. This produces a single ieee.csv without duplicates.
- run scrap_ieee.ipynb. We first get as much pdf as we can, then we loop through the csv and convert pdf2txt to get emails.
- manual cleanup