Knowledge graph analytics platform (KGAP) for LINCS and IDG Common Fund dataset integration. The initial biomedical application is Parkinson's disease drug target discovery, in alignment with the IDSL PRIDE initiative (Parkinson's Research through Integrative Data Experiments).
See also:
- KGAP Project Homepage (includes db dumpfile)
- KGAP LIDIA (LINCS-IDG Drug-target Illumination Application)
Publication LINK: "Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination", by Jeremy Yang, Christopher Gessner, Joel Duerksen, Daniel Biber, Jessica Binder, Brian Foote, Robin McEntire, Kyle Stirling, Ying Ding and David Wild, BMC Bioinformatics, 2022.
- Neo4j Desktop Version 4.2 or newer (select version 4.2 when creating database, and plugin versions are automatically determined by Neo4j version)
- 8GB+ ram
- 5GB+ free disk space
- datasets listed below
- Knime Version 4.3 (Knime will notify for automatic download any nodes required but not yet installed)
- ability to run, load from dump and host temporarily Postgres 10.11 database
- Docker version 20.10.1 or newer, if running the example script
Note: These steps should work with little or no modifications on Macintosh, Unix, and Windows. Process was tested on a MacBookPro 2017 running MacOS Catalina
To restore this graph database from a dump file download this dump file,
install the Neo4j Desktop Client Neo4j and launch it. Create a new 4.2 database (do not start the database). Click on the ... and then click Manage in the pop up menu, and then click on Open Terminal. Restore the database from the dump file with the command, changing PATHTOFILE
as needed.
bin/neo4j-admin load --from=PATHTOFILE/dclneodb.dump --database=neo4j
Next exit the terminal and click Start to start the database. Click Open to use built in Neo4j browser to query and explore the database.
Note: Developed and used on Ubuntu 20.04, and then these directions were tested on a MacbookPro 2017 running MacOS Catalina.
- Clone this repository in your home directory, the following process assumes ~/kgap_lincs-idg/ exists and is populated
- Download, restore and bring online drugcentral_lincs
- Download and install Knime Note: The workflow was created with version 4.3
- Import the knime workflow kgap_lincs-idg/opt1_step1_create_neo4j_files, you will see and safely ignore this dialog message.
- This workflow extracts and transforms data from three datasets, the file tcrd_targets.tsv in this repository, the online DrugCentral database, and the drugcentral_lincs database (loaded above).
- In the knime workflow find the drugcentral_lincs PostgresSQL Connector and change the address/port to point to the drugcentral_lincs database where you are hosting it.
- In Knime "Excecute all executable nodes" for this workflow, the relationship and node files (and the header files) will be stored in opt1_step2_docker_neo4j_load/e and opt1_step2_docker_neo4j_load/v respectively. The script opt1_step2_docker_neo4j_load/clean_ev is a utility script to delete the relationship and node files.
For a full view of the workflow see KGAP-ETL_KNIME_workflow.png.
- An example script to load in neo4j community server running in docker are provided in opt1_step2_docker_neo4j_load
Searching for disease associated genes using KGAP.
- Python 3.7+
- Python packages
neo4j
,sklearn
,psycopg2
,matplotlib
,pandas
- Local instance of KGAP_LINCS-IDG graph database as described above.
The command-line program kgap_analysis.py
may be used to query the database with a disease term, generate associated, scored genes, and perform a ROC validation analysis vs known genes. Credentials
for the public instance of DrugCentral are included as defaults, while the Neo4j credentials
must be valid for your local instance.
kgap_analysis.py -h
usage: kgap_analysis.py [-h] --indication_query INDICATION_QUERY
[--atc_query ATC_QUERY] [--odir ODIR]
[--dc_dbhost DC_DBHOST] [--dc_dbport DC_DBPORT]
[--dc_dbname DC_DBNAME] [--dc_dbusr DC_DBUSR]
[--dc_dbpw DC_DBPW] [--neo4j_uri NEO4J_URI]
[--neo4j_paramfile NEO4J_PARAMFILE] [-v]
KGAP LINCS-IDG search and ROC analysis
optional arguments:
-h, --help show this help message and exit
--indication_query INDICATION_QUERY
DrugCentral indication query
--atc_query ATC_QUERY
DrugCentral ATC L1 query
--odir ODIR output dir
--dc_dbhost DC_DBHOST
DrugCentral DBHOST
--dc_dbport DC_DBPORT
DrugCentral DBPORT
--dc_dbname DC_DBNAME
DrugCentral DBNAME
--dc_dbusr DC_DBUSR DrugCentral DBUSR
--dc_dbpw DC_DBPW DrugCentral DBPW
--neo4j_uri NEO4J_URI
Neo4j DB URI
--neo4j_paramfile NEO4J_PARAMFILE
Neo4j parameter file
-v, --verbose
The specific command used to generate the PD results in our paper is as follows.
kgap_analysis.py --indication_query "Parkinson" --atc_query "NERVOUS SYSTEM"