Visualize and explore fine-mapped signals and their colocalizations
To see an example of a running instance of Colocus, try: https://amp.colocus.app/.
This repository contains the code for the backend server component of Colocus, as well as a docker compose stack to help deploy it.
A .env file must be created before starting the services via docker compose. There are examples in the envs/ directory for a local deployment (env.local), and for a production one (env.production). Copy one, modify it, and save it to the root of the directory as .env. You will only need to change a few values, such as the postgres password and django secret key.
Inside the .env file, be sure to set your DATA_PATH, which points to the location on disk where your data is located. This path must have permissions set such that the container is able to read it. You will need to either make all files world readable, or set them to a group of ID 1001 with read permissions (and execute on directories). Alternatively you could use ACL permissions as well. For example:
# Make world readable
chmod o+rX -R /path/to/data
# Alternatively, make all files GID 1001
chgrp -R 1001 /path/to/data
chmod g+rX -R /path/to/dataNow you can start the docker compose stack:
# This will build images and enable docker compose watch (see docker-compose.override.yml instructions below.)
docker compose up -d --build --watchWhen the colocus-django container starts, it will begin applying django migrations and then load the data located at the DATA_PATH specified in your .env file.
In the future, if you wish to start over and load a new dataset, do the following:
# If loading data from a new path:
# Change your `DATA_PATH` in your `.env` file to match the location of your dataset
# Then restart to force docker to remount DATA_PATH inside the container
# It's best to recreate the container and build so that your new migrations are included (if any)
docker compose up -d --build --force-recreate django
# If you're just reloading an existing dataset in the same DATA_PATH as before, you can start here:
docker compose exec db bash -c 'psql -U colocus -c "DROP DATABASE core WITH (FORCE)"'
docker compose exec db bash /docker-entrypoint-initdb.d/init-db.sh
docker compose exec django bash /opt/colocus/bin/entrypoint-migrate-and-load.shFor debugging a new dataset load:
# Get a shell in container
docker compose exec django bash
# From inside container
source .venv/bin/activate
python3 -m pdb scripts/load_dataset.py /data/your-datasetSentry is an error logging aggregator service. You can sign up for a free account at https://sentry.io/signup/ or download and host it yourself. The system is set up with reasonable defaults, including 404 logging and integration with the WSGI application.
You must set the DSN url in SENTRY_DSN in your .env file.
We have our own deployment and terraform instructions for CSG. There is currently one site deployed, for the Accelerating Medicines Parternship (AMP) group.
Colocus requires a fair number of pieces of data together to function. You will need:
- Marginal association analysis (GWAS, eQTLs)
- Fine-mapping or conditional analysis at loci of interest in your GWAS or eQTL study
- Colocalization analysis results for the signals identified from fine-mapping
- LD that was used to perform the fine-mapping (this is only used for coloring LD on LocusZoom plots and need not be perfect, but ideally will be as close as possible to the fine-mapping LD.)
Colocus requires two types of input summary statistics/results. These can come from a GWAS study of one or more traits, or an eQTL study.
-
The marginal analysis for one or more traits. This is the analysis performed where each variant is tested for association with the trait without adjusting for any other variant (only study-specific covariates, if any).
-
Conditional analysis, per locus, per "signal". At each locus, the number of independent signals must be identified, either through iterative conditional analysis or a fine-mapping approach like SuSiE. Each signal is represented by its lead variant. For each lead signal variant, we re-run the association analysis, but adjust for all other lead signal variants in the region by including them in the regression model. Software such as APEX is capable of doing this automatically.
Both analyses end up with roughly the same type of output, though the conditional analysis is done per signal. The association results are tab-delimited files with the usual columns:
- Variant ID / chrom / pos / reference (ref) allele / alternate (alt) allele
- Association p-value (ideally -log10 p-value)
- Effect size (beta or odds ratio), oriented towards the alternate allele
- Standard error of effect size
- Alternate allele frequency
Directory structure on disk looks like the following:
marginal
├── <dataset UUID>
│ ├── metadata.parquet
│ ├── <trait UUID>
│ │ ├── metadata.parquet
│ │ ├── signals
│ │ │ └── <signal UUID>
│ │ │ ├── metadata.parquet
│ │ │ ├── results.harmonized.gz
│ │ │ └── results.harmonized.gz.tbi
│ │ ├── summ_stats.harmonized.gz
│ │ └── summ_stats.harmonized.gz.tbi
signals.parquet
coloc
└── coloc.parquet
ld
└── UKBB_GRCh37_ALL
├── ld.gz
├── ld.gz.tbi
└── metadata.yml
Each trait has its own directory, which should be named with a unique identifier (UUID). An example metadata.parquet file for a GWAS result contains:
{
"uuid": "gwas_diamante_t2d_eur",
"study": {
"uuid": "DIAMANTE",
"description": "Diabetes Meta-Analysis of Trans-Ethnic Association Studies"
},
"tissue": null,
"ancestry": "EUR",
"publication": {
"authors": "Mahajan et al.",
"journal": "Nature Genetics",
"year": 2022,
"pmid": 35551307,
"doi": null
},
"analysts": null,
"submitter": {
"name": "<last name, first name>",
"abbrev": "<abbrev>",
"email": "<email>",
"orcid": "<orcid>",
"institution": "University of Michigan"
},
"principal_investigators": null,
"genome_build": "GRCh37",
"ld": "UKBB_GRCh37_ALL",
"external_link": "https://diagram-consortium.org/downloads.html",
"analysis_type": "GWAS",
"n_traits": 1,
"n_traits_with_sig": 1
}The parquet file above was rendered in JSON format to make it easier to read, but note that parquet is a columnar data frame format. The metadata file has only a single row above.
For an eQTL trait, an example metadata.parquet file looks like:
{
"uuid": "eqtl_inspire_islet",
"study": {
"uuid": "INSPIRE",
"description": "INSPIRE islet eQTL meta-analysis consortium"
},
"tissue": "islet",
"ancestry": "EUR",
"publication": {
"authors": "Viñuela et al.",
"pmid": 32999275,
"journal": "Nature Communications",
"year": 2020,
"doi": null
},
"analysts": [
{
"name": "<last name, first name>",
"abbrev": "<initials>",
"email": "<email>",
"orcid": "<orcid>",
"institution": "University of Michigan"
}
],
"submitter": {
"name": "<last name, first name>",
"abbrev": "<initials>",
"email": "<email>",
"orcid": "<orcid>",
"institution": "University of Michigan"
},
"principal_investigators": null,
"genome_build": "GRCh37",
"ld": "UKBB_GRCh37_ALL",
"external_link": null,
"analysis_type": "eQTL",
"n_traits": 236,
"n_traits_with_sig": 236
}The file summ_stats.harmonized.gz contains the marginal association results for the trait. It looks like the
following:
#chrom pos rsid ref alt neg_log_pvalue beta stderr_beta alt_allele_freq
1 79033 . A G 0.509 -0.19 0.19 0.999
<additional rows>
The file must be bgzipped and tabix indexed.
Underneath each trait is a signals directory, which contains one subdirectory per signal. Each signal subdirectory
should be named with a unique identifier or uuid. This uuid must be unique across all signals for all traits.
The metadata.parquet file for a signal looks like the following:
{
"uuid": "Lc7hEWyp24Nco8j97GXrfr",
"lead_variant": {
"chrom": "9",
"pos": 136241189,
"ref": "C",
"alt": "T"
},
"neg_log_p": 51.647,
"effect_cond": -15.196,
"se_cond": 0.998,
"effect_marg": -0.307,
"is_marg": false,
"cs_variants": [
"9_136218590_C_A",
"9_136238509_G_A",
"9_136241189_C_T",
"9_136241639_C_T",
"9_136249929_G_A",
"9_136264493_C_T",
"9_136267371_G_T"
],
"cs_alpha": [
0.0665931,
0.037789,
0.4947666,
0.0755713,
0.1302856,
0.1038847,
0.0618499
],
"finemap_program": {
"name": "susieR",
"version": "v0.0.0"
}
}The fields above are mostly self explanatory. Some require a bit of clarification:
-
neg_log_p: This is the p-value from conditional analysis or fine-mapping -
effect_condandeffect_marg: The effect size from the conditional analysis or fine-mapping, and the marginal effect size. -
is_marg: This denotes whether this particular signal was taken from the marginal association statistics directly. Sometimes it is the case that no fine-mapping is done at a particular locus, for example in the event there is only a single association signal, or if fine-mapping fails. -
cs_alpha: This field is only present if the fine-mapping was done with SuSiE. In that case, the alphas are the posterior inclusion probabilities (conditional on the signal) for each credible set variant. The marginal inclusion probabilities can be found in theresults.harmonized.gzfile. -
cs_variants: This field contains the list of each variant in the credible set. The values incs_alphacorrespond in order to each variant in this list.
In each signal directory is the conditional association results file results.harmonized.gz for that signal. It is
identical in format to the summ_stats.harmonized.gz file.
There is also a master signals.parquet file that contains the information about all signals across all datasets. This file is NOT REQUIRED, however it may be useful for debugging. Each record in the file looks like the following (rendered as JSON for easier reading):
{
"sig_uuid": "UXPTfTuQtfikGyjD2hHKmh",
"study_uuid": "gwas_diamante_t2d_eur",
"lead_variant": "4_1784403_C_T",
"susie_idx": 1,
"susie_cs": 1,
"susie_cs_variants": [
"4_1784403_C_T",
"4_1784605_G_C"
],
"susie_cs_alpha": [
0.7572763,
0.1992575
],
"path": "data/orig/muscislet/t2d_gwas_susie/diamante_T2D-European__MAEA__rs56337234__P__chr4-1534402-2034403__250kb.selected.Rda",
"extract_marginal": false,
"study_type": "GWAS",
"trait": "T2D",
"gene": null,
"exon": null,
"feature": "T2D",
"tissue": null,
"cell_type": null,
"trust_alleles": true,
"finemap_program": "susieR",
"finemap_version": "v0.0.0",
"genome_build": "GRCh37"
}Fields that require some explanation:
susie_idx: If this record is for a fine-mapped signal that came from SuSiE, this field will contain the row index of the SuSiE matrices (such as alpha, mu, mu2, lbf_variable, etc.) to extract.susie_cs: This is index into the SuSiE credible sets listextract_marginal: Same asextract_margin the individual metadata files for each signal. Denotes whether this signal was extracted from the marginal association data, perhaps because fine-mapping was not run or failed.trust_alleles: Setting denotes whether we can trust the alleles provided by the study. If we cannot trust the alleles, they have been remapped using dbSNP and/or the LD reference to identify which variant is the ref and which is the alt.
Each lead variant for all signals must have LD calculated between it and all other variants in the region (up to any distance cutoff, but typically 1 Mb).
Ideally, LD would be calculated either in the original samples used for the GWAS/eQTL study, or in a large reference panel (one that matches the original population as closely as possible, and ideally the same one that was used when performing fine-mapping if a reference panel was used in that analysis).
LD information is stored on disk in the following format:
ld/
└── <ld-panel-uuid>
├── ld.gz
├── ld.gz.tbi
├── metadata.yml
The ld-panel-uuid is a unique identifier for the LD panel used to calculate LD. Within each directory, there is a
metadata.yml file that provides information about the panel. As an example:
uuid: 'UKBB_GRCh37_ALL'
panel: 'UKBB'
population: 'ALL'
genome_build: 'GRCh37'The ld.gz file contains the calculated LD information, concatenated together and bgzipped. It is a tab-delimited
file with the following format:
1 1242707 1:1242707_A/G 1 742813 1:742813_C/T 0.000561357
1 1242707 1:1242707_A/G 1 742825 1:742825_A/G 0.000542286
1 1242707 1:1242707_A/G 1 742832 1:742832_C/A 0.000123633
Columns 1, 2, 3 pertain to the lead variant. They are the chromosome, position, and variant ID respectively.
Columns 4, 5, 6 pertain to the "other" variants in the region. They are also the chromosome, position, and variant ID.
The final column is the LD (r2) value between the lead variant and the other variant.
This file should be sorted by chromosome and position for the lead variants (columns 1-3), and then bgzipped and tabix indexed.
Colocalization results are stored on disk in a single file:
coloc
└── coloc.parquet
Each colocalization result for a pair of signals is given its own unique UUID. These UUIDs must be unique across all colocalization results for all traits.
The metadata.parquet file has the following information:
{
"uuid": "7yCjsigpyW9AWVgcqM7SkF",
"signal1": "918mCCrkd6US8F8qbyLDoi",
"signal2": "EuaY2JFmhPH7fAUEduozHz",
"coloc_h3": 0.904439412729006,
"coloc_h4": 0.0034383473563195,
"dataset1": "gwas_diamante_t2d_eur",
"dataset2": "eqtl_inspire_islet",
"trait1": "T2D",
"trait2": "ENSG00000114770_183645118_183645231",
"trait2_symb": null,
"trait1_variant": "3_183738626_T_A",
"trait2_variant": "3_183683124_G_A",
"cross_signal": {
"effect": [
[
-0.036,
0.0468
],
[
-0.018,
0.35
]
],
"se": [
[
0.0063,
0.0322
],
[
0.013,
0.0667
]
],
"log_pval": [
[
7.7,
0.835
],
[
0.796,
6.59
]
]
},
"r2": 0.0564906,
"n_coloc_between_traits": 0
}The fields are:
- uuid: unique identifier for this colocalization result
- signal1: uuid of the first signal
- signal2: uuid of the second signal
- dataset1: first signal's study uuid
- dataset2: second signal's study uuid
- trait1: uuid of the first trait
- trait2: uuid of the second trait
- trait2_symb: gene symbol of trait 2 if applicable
- trait1_variant: lead variant for the first signal
- trait2_variant: lead variant for the second signal
- coloc_h3: Posterior probability of H3 from coloc
- coloc_h4: Posterior probability of H4 from coloc
- cross_signal: cross signal information; each subfield 'effect', 'se', etc. is a list of lists (a matrix), where:
- row 1 is the first trait's variant,
- row 2 is the second trait's variant,
- col 1 is the first trait's data (effect in marginal summary statistics)
- col 2 is the second trait's data (effect in marginal summary statistics)
- n_coloc_between_traits: number of total colocalizations found between the two traits; this is used in the web UI
- r2: the LD between trait1_variant and trait2_variant
More information on the required types of data can be found below under required data.
There is an example test dataset in colocus/tests/data/ that can be used to test loading data. It is the same dataset used for running test cases.
Make a docker-compose.override.yml that enables watching files and rebuilding container images as needed:
While developing you may want the containers to rebuild or resync with your source files changing automatically. The following can be placed in a docker-compose.override.yml file:
services:
django:
build: .
command: --reload
develop:
watch:
- action: sync
path: ./colocus
target: /opt/colocus/colocus
- action: sync
path: ./colocus/core/migrations
target: /opt/colocus/colocus/core/migrations
- action: sync
path: ./scripts
target: /opt/colocus/scripts
- action: rebuild
path: ./Dockerfile
ui:
build:
context: ../colocus-ui-vue3
dockerfile: Dockerfile.dev
ports:
- "${VITE_PORT}:${VITE_PORT}"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:${VITE_PORT}"]
interval: 5s
timeout: 5s
retries: 5
develop:
watch:
- action: rebuild
path: ../colocus-ui-vue3/package.json
- action: rebuild
path: ../colocus-ui-vue3/vite.config.mjs
- action: sync
path: ../colocus-ui-vue3/src
target: /app/src
- action: sync
path: ../colocus-ui-vue3/etc
target: /app/etc
- action: rebuild
path: ../colocus-ui-vue3/Dockerfile.devThis assumes you have colocus and colocus-ui-vue3 repositories checked out and next to each other in the directory hierarchy.
For example, your directory tree should look like this:
root
| - colocus-ui-vue3
| - colocus
You'll also want the following in your .env file:
VITE_HOST=0.0.0.0
VITE_PORT=5173
VITE_API_URL=http://django:${UVICORN_PORT}Now you can run:
docker compose up --build --watchThis will bring up the containers, building them as needed, and watch to see if source code files change. If any of the source changes, the container image will be rebuilt if necessary, and restarted. In the case of the Django container, it will instead just sync the new source files into the container, and then the uvicorn server will reload them.
To run code checks, you will need to install uv first, and then required packages:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install packages
uv syncThe project is setup to use pre-commit to run all checks at once. You can either install the pre-commit git hooks, or run pre-commit yourself manually before committing.
To run pre-commit manually:
uv run pre-commit run --all-files -vThis is the same command our Github Actions CI will run when you push a commit.
bin/compose-run-pytest.sh