Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clumping GWAS Catalog top hits #3467

Open
d0choa opened this issue Sep 16, 2024 · 3 comments · May be fixed by opentargets/gentropy#779
Open

Clumping GWAS Catalog top hits #3467

d0choa opened this issue Sep 16, 2024 · 3 comments · May be fixed by opentargets/gentropy#779
Assignees
Labels
Data Relates to Open Targets data team Gentropy Relates to the genetics ETL

Comments

@d0choa
Copy link
Contributor

d0choa commented Sep 16, 2024

GWAS Catalog top-hits don't have any clumping strategy. If a GWAS catalog study reports many associations within an region/haplotype we have no way to control. This might result on an artificially high number of credible sets resulted from PICS.

Next, an example on GCST90321118

Image

We want to perform clumping on these credible sets, but we still need to scope the technical strategy to implement this.

@d0choa d0choa added Data Relates to Open Targets data team Gentropy Relates to the genetics ETL labels Sep 16, 2024
@addramir
Copy link

I think the best person to implement this is @DSuveges

@addramir addramir assigned addramir and DSuveges and unassigned addramir Sep 19, 2024
@DSuveges
Copy link

As a related issue, it has been identified that the large number of credible sets in the PICSed GWAS Catalog curated dataset can be partially explained by a bug in the LD clumping method. However after fixing that issue, there are still a relatively high number of associations on chromosome 7 of this study:

+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|variantId                                                                                                            |studyId     |chromosome|position |pValueMantissa|pValueExponent|qualityControls                                                   |
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
|7_37918687_G_A                                                                                                       |GCST90321118|7         |37918687 |1.0           |-8            |[]                                                                |
|7_38060307_C_CT                                                                                                      |GCST90321118|7         |38060307 |5.0           |-18           |[]                                                                |
|7_38109854_T_TA                                                                                                      |GCST90321118|7         |38109854 |6.0           |-12           |[Variant not found in LD reference]                               |
|7_38113261_A_G                                                                                                       |GCST90321118|7         |38113261 |2.0           |-16           |[]                                                                |
|7_96514529_A_AC                                                                                                      |GCST90321118|7         |96514529 |2.0           |-9            |[]                                                                |
|7_121084734_C_T                                                                                                      |GCST90321118|7         |121084734|6.0           |-10           |[]                                                                |
|7_121117073_C_T                                                                                                      |GCST90321118|7         |121117073|2.0           |-11           |[Variant not found in LD reference]                               |
|7_121241062_G_GAATTGGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCAGTAC|GCST90321118|7         |121241062|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241063_G_GGATGGAAAAATAAGCACTTTTGAGGAAGATAATCTTTATTTTGCCATTCAAAAACCAGCATCTCT                                     |GCST90321118|7         |121241063|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241065_C_CATTCAAAAACCAGCATCTCTCCTAAATTTTCTGTTGTTTCTTTTAGCA                                                      |GCST90321118|7         |121241065|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121241065_C_T                                                                                                      |GCST90321118|7         |121241065|1.0           |-34           |[Variant not found in LD reference]                               |
|7_121251832_A_G                                                                                                      |GCST90321118|7         |121251832|3.0           |-32           |[Variant not found in LD reference]                               |
|7_121313702_G_A                                                                                                      |GCST90321118|7         |121313702|1.0           |-14           |[]                                                                |
|7_121320217_G_C                                                                                                      |GCST90321118|7         |121320217|1.0           |-126          |[Palindrome alleles - cannot harmonize]                           |
|7_121325298_C_T                                                                                                      |GCST90321118|7         |121325298|1.0           |-19           |[]                                                                |
|7_121325508_A_G                                                                                                      |GCST90321118|7         |121325508|7.0           |-13           |[]                                                                |
|7_121327159_A_T                                                                                                      |GCST90321118|7         |121327159|3.0           |-25           |[Palindrome alleles - cannot harmonize]                           |
|7_121364935_G_A                                                                                                      |GCST90321118|7         |121364935|6.0           |-15           |[]                                                                |
|7_121373353_T_G                                                                                                      |GCST90321118|7         |121373353|2.0           |-12           |[]                                                                |
|7_121386694_T_C                                                                                                      |GCST90321118|7         |121386694|6.0           |-9            |[LD block does not contain variants at the required R^2 threshold]|
+---------------------------------------------------------------------------------------------------------------------+------------+----------+---------+--------------+--------------+------------------------------------------------------------------+
only showing top 20 rows

Some of these credible sets were included with the flag: lead not found in credible set, however when this flag is absent, the returned LD sets were not overlapping.

This observed behaviour justifies the application an extra, window based clumping on the GWAS Catalog curated associations.

@DSuveges
Copy link

Test dataset showing a StudyLocus dataset before and after window based clumping:

+-------+----------+--------+--------------+--------------------+--------------+---------+
|studyId|chromosome|position|pValueExponent|        studyLocusId|pValueMantissa|variantId|
+-------+----------+--------+--------------+--------------------+--------------+---------+
|     s1|        c1|       1|            -1|  816176356781534521|           1.0|       v1|
|     s1|        c1|       3|            -3| -206100010007302174|           1.0|       v4|
|     s1|        c1|       2|            -2|-4721564960210010127|           1.0|       v2|
|     s1|        c2|       2|            -2|-2919469633967748933|           1.0|       v3|
|     s3|        c2|       2|            -2| 6166427946174414045|           1.0|       v1|
+-------+----------+--------+--------------+--------------------+--------------+---------+


In [2]: sl.window_based_clumping(3).df.show()
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|studyId|chromosome|position|pValueExponent|studyLocusId        |pValueMantissa|variantId|qualityControls                                             |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+
|s1     |c1        |1       |-1            |1740131172091600674 |1.0           |v1       |[Explained by a more significant variant in the same window]|
|s1     |c1        |2       |-2            |-6342038754064840370|1.0           |v2       |[Explained by a more significant variant in the same window]|
|s1     |c1        |3       |-3            |-3040002280507636093|1.0           |v4       |[]                                                          |
|s1     |c2        |2       |-2            |8923642814302707841 |1.0           |v3       |[]                                                          |
|s3     |c2        |2       |-2            |-218747710423759089 |1.0           |v1       |[]                                                          |
+-------+----------+--------+--------------+--------------------+--------------+---------+------------------------------------------------------------+

(The studyLocus Id is changed, but that's fine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Relates to Open Targets data team Gentropy Relates to the genetics ETL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants