-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LD clumping doesn't work as intended #3480
Comments
These are the variants of this locus that are all linked together and should be flagged as LD clumped: test_variants = [
'7_121325410_G_GCACC',
'7_121326736_A_G',
'7_121280773_G_A',
'7_121292393_A_G',
'7_121308853_A_G',
'7_121287383_A_AC',
'7_121308406_G_C',
'7_121305274_T_C',
'7_121301780_C_CGT',
'7_121320217_G_C',
'7_121319081_C_T',
'7_121329915_G_GCT',
'7_121354115_G_T',
'7_121378803_A_T',
'7_121332237_C_CGT'
] When looking into the clumped credible set dataset none of them are flagged: extracted = (
StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
.filter(
f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
)
)
print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show() Giving:
On the same dataset, if we call clump, some indeed clumped: extracted = (
StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
.filter(
f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
).clump()
)
print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show()
Apparently after applying a filter, calling clumping on an already clumped dataset yields new clumped loci. This behaviour indicates some inconsistency in the clumping logic. |
As @d0choa has noticed, there are something off with the LD clumped GWAS associations. When taking a closer look, it was apparent that the LD clumped GWAS Catalog curation dataset contains unflagged associations, where the credible set contains tag variants with more significant p-value, however LD clumping step is called on the dataset.
Test case:
Showing there are ~600 top-hits in this study/chromosome. Many of them flagged as being LD clumped. When applying a following filter:
we get 206 associations. It suggests that in this chromosome there are 200 independent top hits, implying the ld sets of these associations are not overlapping. However this is not true. The most significant association of this region (
7_121320217_G_C
with p-value 1e-120) can be found in 15, otherwise un-flagged associations' ld set:This number should be zero.
The text was updated successfully, but these errors were encountered: