LD clumping doesn't work as intended #3480

DSuveges · 2024-09-17T09:59:42Z

As @d0choa has noticed, there are something off with the LD clumped GWAS associations. When taking a closer look, it was apparent that the LD clumped GWAS Catalog curation dataset contains unflagged associations, where the credible set contains tag variants with more significant p-value, however LD clumping step is called on the dataset.

Test case:

sl = StudyLocus.from_parquet(session, 'gwas_catalog_PICSed_curated_associations')

# Problematic study and chromosome:
study_id = 'GCST90321118'
chromosome = '7'

(
    sl.df.filter(f.col('studyId') == study_id) & (f.col('chromosome') == chromosome)
    .count()
)

Showing there are ~600 top-hits in this study/chromosome. Many of them flagged as being LD clumped. When applying a following filter:

(
    sl.df.filter(
        (f.col('studyId') == study_id) & (f.col('chromosome') == chromosome) &
        (~f.array_contains(f.col("qualityControls"), 'Explained by a more significant variant in high LD (clumped)'))
    )
    .count()
)

we get 206 associations. It suggests that in this chromosome there are 200 independent top hits, implying the ld sets of these associations are not overlapping. However this is not true. The most significant association of this region (7_121320217_G_C with p-value 1e-120) can be found in 15, otherwise un-flagged associations' ld set:

top_association = '7_121320217_G_C'

(
    sl.df.filter(
        (f.col('studyId') == study_id) & (f.col('chromosome') == chromosome) &
       # Dropping already flagged associations:
        (~f.array_contains(f.col("qualityControls"), 'Explained by a more significant variant in high LD (clumped)'))
    )
    .filter(
        # Filtering for associations, which are containing the most significant variant as a tag:
        f.array_contains(
            f.transform(
                f.col('ldSet'),
                lambda x: x['tagVariantId']
            ),
            top_association
        )
    )
    .orderBy(f.col('pValueExponent').asc())
    .count()
)

This number should be zero.

The text was updated successfully, but these errors were encountered:

DSuveges · 2024-09-18T15:13:27Z

These are the variants of this locus that are all linked together and should be flagged as LD clumped:

test_variants = [
    '7_121325410_G_GCACC',
    '7_121326736_A_G',
    '7_121280773_G_A',
    '7_121292393_A_G',
    '7_121308853_A_G',
    '7_121287383_A_AC',
    '7_121308406_G_C',
    '7_121305274_T_C',
    '7_121301780_C_CGT',
    '7_121320217_G_C',
    '7_121319081_C_T',
    '7_121329915_G_GCT',
    '7_121354115_G_T',
    '7_121378803_A_T',
    '7_121332237_C_CGT'
]

When looking into the clumped credible set dataset none of them are flagged:

extracted = (
    StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
    .filter(
        f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
    )
)

print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show()

Giving:

+-------------------+--------------------+---------+
|          variantId|     qualityControls|ldSetSize|
+-------------------+--------------------+---------+
|7_121325410_G_GCACC|                  []|       80|
|    7_121326736_A_G|                  []|       83|
|    7_121280773_G_A|                  []|       93|
|    7_121292393_A_G|                  []|       65|
|    7_121308853_A_G|                  []|       62|
|   7_121287383_A_AC|                  []|       61|
|    7_121308406_G_C|[Palindrome allel...|       64|
|    7_121305274_T_C|                  []|       58|
|  7_121301780_C_CGT|                  []|       42|
|    7_121320217_G_C|[Palindrome allel...|       70|
|    7_121319081_C_T|                  []|       56|
|  7_121329915_G_GCT|                  []|       75|
|    7_121354115_G_T|                  []|       76|
|    7_121378803_A_T|[Palindrome allel...|      124|
|  7_121332237_C_CGT|                  []|       79|
+-------------------+--------------------+---------+

On the same dataset, if we call clump, some indeed clumped:

extracted = (
    StudyLocus.from_parquet(session, "/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations")
    .filter(
        f.col('variantId').isin(test_variants) & (f.col('studyId') == study_id)
    ).clump()
)

print(extracted.df.count())
extracted.df.select('variantId', 'qualityControls', f.size('ldSet').alias('ldSetSize')).show()

+-------------------+-----------------------------------------------------------------------------------------------------+---------+
|variantId          |qualityControls                                                                                      |ldSetSize|
+-------------------+-----------------------------------------------------------------------------------------------------+---------+
|7_121301780_C_CGT  |[]                                                                                                   |42       |
|7_121280773_G_A    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121319081_C_T    |[]                                                                                                   |56       |
|7_121320217_G_C    |[Palindrome alleles - cannot harmonize]                                                              |70       |
|7_121292393_A_G    |[]                                                                                                   |65       |
|7_121308853_A_G    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121287383_A_AC   |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121305274_T_C    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121308406_G_C    |[Palindrome alleles - cannot harmonize, Explained by a more significant variant in high LD (clumped)]|0        |
|7_121326736_A_G    |[]                                                                                                   |83       |
|7_121329915_G_GCT  |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121325410_G_GCACC|[]                                                                                                   |80       |
|7_121354115_G_T    |[Explained by a more significant variant in high LD (clumped)]                                       |0        |
|7_121378803_A_T    |[Palindrome alleles - cannot harmonize]                                                              |124      |
|7_121332237_C_CGT  |[]                                                                                                   |79       |
+-------------------+-----------------------------------------------------------------------------------------------------+---------+

Apparently after applying a filter, calling clumping on an already clumped dataset yields new clumped loci. This behaviour indicates some inconsistency in the clumping logic.

DSuveges added Data Relates to Open Targets data team Gentropy Genetics Relates to Open Targets genetics team labels Sep 17, 2024

d0choa added the Bug Something isn't working label Sep 17, 2024

DSuveges self-assigned this Sep 18, 2024

DSuveges linked a pull request Sep 19, 2024 that will close this issue

fix(ld clumping): a revised logic allows a more accurate clumping opentargets/gentropy#772

Merged

9 tasks

DSuveges closed this as completed in opentargets/gentropy#772 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LD clumping doesn't work as intended #3480

LD clumping doesn't work as intended #3480

DSuveges commented Sep 17, 2024

DSuveges commented Sep 18, 2024

LD clumping doesn't work as intended #3480

LD clumping doesn't work as intended #3480

Comments

DSuveges commented Sep 17, 2024

Test case:

DSuveges commented Sep 18, 2024