Skip to content

A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.

Notifications You must be signed in to change notification settings

c1a1o1/scrap-science

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrap-science

A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.

Setup

  • python environment
  • chrome driver
  • pdf2txt

arxiv

  1. get all search result urls and their corresponding number of search results
  2. dump these into scrap_arvix.ipynb and get a clean duplicate free list of urls of individual papers
  3. get all pdfs
  4. scarp data
  5. manula quality check

bioRxiv

  1. get all search result urls and their corresponding number of search results
  2. dump these into scrap_biorxiv.ipynb and get a clean duplicate free list of urls of individual papers
  3. now scarp - this step includes getting the pdf too.
  4. manual quality check

Pubmed

  1. search pubmed and download results as .csv into raw_result folder
  2. use scrap_pubmed.ipynb to combine all csv's, remove duplicates and finally scrap it (no pdfs)
  3. save as .csv and do manual search quality check

MICCAI

  1. get content pdfs from springer
  • 2014 and 2015 - get urls manually
  • 2016 and 2017 - pdfs have urls in them
  1. run getMiccaiUrls.py to get the urls in the pdf and dump them in a list as a .npy file
  2. read these in scrap_miccai.ipynb and add to them those hardcoded from 2014 and 2015
  3. run scrap_miccai.ipynb (no pdfs)

IEEE

  1. go to http://ieeexplore.ieee.org/Xplore/home.jsp
  2. enter keywords and download .csv. Link to search will be in the first row.
  3. combine and clean the multiple downloaded .csv's using combine_Ieee.ipynb. This produces a single ieee.csv without duplicates.
  4. run scrap_ieee.ipynb. We first get as much pdf as we can, then we loop through the csv and convert pdf2txt to get emails.
  5. manual cleanup

About

A bunch of Jupyter notebooks to scrap some of the most popular web platforms for scientific papers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published