Skip to content

IntelCompH2020/nihmporter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2c20e97 · Apr 23, 2024

History

33 Commits
Jul 6, 2021
Sep 23, 2021
Jul 6, 2021
Feb 7, 2022
Apr 23, 2024
Jul 21, 2021
Dec 19, 2019
Sep 23, 2021
Jan 19, 2022
Jan 19, 2022
Jul 6, 2021
Sep 23, 2021
Sep 23, 2021
Sep 23, 2021

Repository files navigation

nihmporter is Python software to download and pack the data published by the National Institute of Health.

Installation

You can use make_conda_environment.sh to build a proper Anaconda environment (by default, named nih), or inspect it to see the exact requirements.

Usage

Activate the above environment and run

# after activating the appropriate conda environement
./import.py

It should result in some feather/pickle (as of July 2021, huge feather files cause memory issues) files, each one storing a Pandas DataFrame. In any one of them, the same record might (most likely will) show up more than once since, until its final release, the information of a contract is updated in different files (which import.py stitches together) at successive dates. For more details see the About section.

The script also produces a bunch of csv files which subset the above feather/pickle files into some data exploited by the (extra) utiliy connectivity_stats.py.

Re-runs

If the script is re-run (in the same directory), many already existing files will be reused (i.e., not downloaded again). In particular, whenever the program is about to download some zip file, it will only do so if it is not already present, or if the homonymous file in the server is more recent (in which case the local file will be overwritten).

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004870. H2020-SC6-GOVERNANCE-2018-2019-2020 / H2020-SC6-GOVERNANCE-2020