Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LD clumping doesn't work as intended #3480

Closed
DSuveges opened this issue Sep 17, 2024 · 1 comment · Fixed by opentargets/gentropy#772
Closed

LD clumping doesn't work as intended #3480

DSuveges opened this issue Sep 17, 2024 · 1 comment · Fixed by opentargets/gentropy#772
Assignees
Labels
Bug Something isn't working Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team

Comments

@DSuveges
Copy link

As @d0choa has noticed, there are something off with the LD clumped GWAS associations. When taking a closer look, it was apparent that the LD clumped GWAS Catalog curation dataset contains unflagged associations, where the credible set contains tag variants with more significant p-value, however LD clumping step is called on the dataset.

Test case:

sl = StudyLocus.from_parquet(session, 'gwas_catalog_PICSed_curated_associations')

# Problematic study and chromosome:
study_id = 'GCST90321118'
chromosome = '7'

(
    sl.df.filter(f.col('studyId') == study_id) & (f.col('chromosome') == chromosome)
    .count()
) 

Showing there are ~600 top-hits in this study/chromosome. Many of them flagged as being LD clumped. When applying a following filter:

(
    sl.df.filter(
        (f.col('studyId') == study_id) & (f.col('chromosome') == chromosome) &
        (~f.array_contains(f.col("qualityControls"), 'Explained by a more significant variant in high LD (clumped)'))
    )
    .count()
)

we get 206 associations. It suggests that in this chromosome there are 200 independent top hits, implying the ld sets of these associations are not overlapping. However this is not true. The most significant association of this region (7_121320217_G_C with p-value 1e-120) can be found in 15, otherwise un-flagged associations' ld set:

top_association = '7_121320217_G_C'

(
    sl.df.filter(
        (f.col('studyId') == study_id) & (f.col('chromosome') == chromosome) &
       # Dropping already flagged associations:
        (~f.array_contains(f.col("qualityControls"), 'Explained by a more significant variant in high LD (clumped)'))
    )
    .filter(
        # Filtering for associations, which are containing the most significant variant as a tag:
        f.array_contains(
            f.transform(
                f.col('ldSet'),
                lambda x: x['tagVariantId']
            ),
            top_association
        )
    )
    .orderBy(f.col('pValueExponent').asc())
    .count()
)

This number should be zero.

@DSuveges DSuveges added Data Relates to Open Targets data team Gentropy Genetics Relates to Open Targets genetics team labels Sep 17, 2024
@d0choa d0choa added the Bug Something isn't working label Sep 17, 2024
@DSuveges DSuveges self-assigned this Sep 18, 2024
@DSuveges
Copy link
Author

These are the variants of this locus that are all linked together and should be flagged as LD clumped:

test_variants = [
    '7_121325410_G_GCACC',
    '7_121326736_A_G',
    '7_121280773_G_A',
    '7_121292393_A_G',
    '7_121308853_A_G',
    '7_121287383_A_AC',
    '7_121308406_G_C',
    '7_121305274_T_C',
    '7_121301780_C_CGT',
    '7_121320217_G_C',
    '7_121319081_C_T',
    '7_121329915_G_GCT',
    '7_121354115_G_T',
    '7_121378803_A_T',
    '7_121332237_C_CGT'
]

When looking into the clumped credible set dataset none of them are flagged:

extracted = (
    StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
    .filter(
        f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
    )
)

print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show()

Giving:

+-------------------+--------------------+---------+
|          variantId|     qualityControls|ldSetSize|
+-------------------+--------------------+---------+
|7_121325410_G_GCACC|                  []|       80|
|    7_121326736_A_G|                  []|       83|
|    7_121280773_G_A|                  []|       93|
|    7_121292393_A_G|                  []|       65|
|    7_121308853_A_G|                  []|       62|
|   7_121287383_A_AC|                  []|       61|
|    7_121308406_G_C|[Palindrome allel...|       64|
|    7_121305274_T_C|                  []|       58|
|  7_121301780_C_CGT|                  []|       42|
|    7_121320217_G_C|[Palindrome allel...|       70|
|    7_121319081_C_T|                  []|       56|
|  7_121329915_G_GCT|                  []|       75|
|    7_121354115_G_T|                  []|       76|
|    7_121378803_A_T|[Palindrome allel...|      124|
|  7_121332237_C_CGT|                  []|       79|
+-------------------+--------------------+---------+

On the same dataset, if we call clump, some indeed clumped:

extracted = (
    StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
    .filter(
        f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
    ).clump()
)

print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show()
+-------------------+-----------------------------------------------------------------------------------------------------+---------+
|variantId          |qualityControls                                                                                      |ldSetSize|
+-------------------+-----------------------------------------------------------------------------------------------------+---------+
|7_121301780_C_CGT  |[]                                                                                                   |42       |
|7_121280773_G_A    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121319081_C_T    |[]                                                                                                   |56       |
|7_121320217_G_C    |[Palindrome alleles - cannot harmonize]                                                              |70       |
|7_121292393_A_G    |[]                                                                                                   |65       |
|7_121308853_A_G    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121287383_A_AC   |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121305274_T_C    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121308406_G_C    |[Palindrome alleles - cannot harmonize, Explained by a more significant variant in high LD (clumped)]|0        |
|7_121326736_A_G    |[]                                                                                                   |83       |
|7_121329915_G_GCT  |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121325410_G_GCACC|[]                                                                                                   |80       |
|7_121354115_G_T    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121378803_A_T    |[Palindrome alleles - cannot harmonize]                                                              |124      |
|7_121332237_C_CGT  |[]                                                                                                   |79       |
+-------------------+-----------------------------------------------------------------------------------------------------+---------+

Apparently after applying a filter, calling clumping on an already clumped dataset yields new clumped loci. This behaviour indicates some inconsistency in the clumping logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Data Relates to Open Targets data team Genetics Relates to Open Targets genetics team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants