scrap-science

A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.

Setup

python environment
chrome driver
pdf2txt

arxiv

get all search result urls and their corresponding number of search results
dump these into scrap_arvix.ipynb and get a clean duplicate free list of urls of individual papers
get all pdfs
scarp data
manula quality check

bioRxiv

get all search result urls and their corresponding number of search results
dump these into scrap_biorxiv.ipynb and get a clean duplicate free list of urls of individual papers
now scarp - this step includes getting the pdf too.
manual quality check

Pubmed

search pubmed and download results as .csv into raw_result folder
use scrap_pubmed.ipynb to combine all csv's, remove duplicates and finally scrap it (no pdfs)
save as .csv and do manual search quality check

MICCAI

get content pdfs from springer

2014 and 2015 - get urls manually
2016 and 2017 - pdfs have urls in them

run getMiccaiUrls.py to get the urls in the pdf and dump them in a list as a .npy file
read these in scrap_miccai.ipynb and add to them those hardcoded from 2014 and 2015
run scrap_miccai.ipynb (no pdfs)

IEEE

go to http://ieeexplore.ieee.org/Xplore/home.jsp
enter keywords and download .csv. Link to search will be in the first row.
combine and clean the multiple downloaded .csv's using combine_Ieee.ipynb. This produces a single ieee.csv without duplicates.
run scrap_ieee.ipynb. We first get as much pdf as we can, then we loop through the csv and convert pdf2txt to get emails.
manual cleanup

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
biorxiv		biorxiv
ieee		ieee
miccai		miccai
pubmed		pubmed
.gitignore		.gitignore
README.md		README.md
combineAndCurate.ipynb		combineAndCurate.ipynb
status.csv		status.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrap-science

Setup

arxiv

bioRxiv

Pubmed

MICCAI

IEEE

About

Releases

Packages

Languages

c1a1o1/scrap-science

Folders and files

Latest commit

History

Repository files navigation

scrap-science

Setup

arxiv

bioRxiv

Pubmed

MICCAI

IEEE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages