Skip to content

Adding Strandcheck and Strandflip-correction#375

Open
stschiff wants to merge 15 commits intomasterfrom
strandcheck
Open

Adding Strandcheck and Strandflip-correction#375
stschiff wants to merge 15 commits intomasterfrom
strandcheck

Conversation

@stschiff
Copy link
Copy Markdown
Member

@stschiff stschiff commented Mar 31, 2026

This is going to add an important feature. For now just work in progress, will update that message in due time. The branch is on top of v2.0.0.0, so commits within this PR will simplify once #369 is merged.

@nevrome nevrome mentioned this pull request Apr 1, 2026
@stschiff stschiff marked this pull request as ready for review April 30, 2026 10:07
@stschiff
Copy link
Copy Markdown
Member Author

stschiff commented Apr 30, 2026

OK, I open this for code review new. It's probably hard to review, sorry @nevrome. The important points:

  • I changed the machinery behind merging SNP sets with an emphasis on pure functions, with no logging. Any logging happens as a Left, which is then caught be the joinEntryPipe to report as error and resulting in skipping the SNP. Any silent decisions are not anymore logged. I found this far easier to reason about and to maintain, and it was really unnecessary to bother the user with logging of trivial cases like merging genetic positions or snpIds.
  • I changed the reasoning behind getConsensusSnpEntry, which now uses two different algorithms to determine the consensus SNPs for the two cases with or without strand flips. This is non-breaking for the default case of no strand flips.
  • I simplified the code around recodeAlleles, which has always been rather complicated because of inherent complexity of dealing with missing alleles. I think the code is now clearer and easier to maintain, and it should be non-breaking for the case of no strand-flips (unit and golden tests confirm).
  • the new option --strandCheck is only available in forge. I do not allow strand-flips in any other sub-command, as I think that is harder to explain then. If a user suspects strand flips (they will notice because the merged number of SNPs will be very low in case there are strand flips but trident assumes none), they should first forge (and thereby correct strand flips), and then run any additional command. At least I as a user would find it a little bit "too smart", if trident where able to correct strand flips somewhat under the hood implicitly every time it reads in multiple datasets.... but that's something that I'm happy to discuss... would be very easy to add the option also to other sub-commands that use getJointGenotypeData.

Finally, I have not battle-tested this new feature yet, and I am planning to reach out to users who have reported strand flips in the past. So while this is ready for review, I would like to delay merging until I've had a chance to see it in practice.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 59.77011% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.15%. Comparing base (edc8849) to head (469801a).

Files with missing lines Patch % Lines
src/Poseidon/Core/GenotypeData.hs 63.15% 18 Missing and 10 partials ⚠️
...Poseidon/CLI/Trident/OptparseApplicativeParsers.hs 0.00% 3 Missing ⚠️
src/Poseidon/CLI/Trident/Forge.hs 0.00% 1 Missing and 1 partial ⚠️
src/Poseidon/CLI/Xerxes/FStats.hs 0.00% 1 Missing ⚠️
src/Poseidon/Core/Package.hs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #375      +/-   ##
==========================================
+ Coverage   51.80%   52.15%   +0.34%     
==========================================
  Files          36       36              
  Lines        5870     5881      +11     
  Branches      641      642       +1     
==========================================
+ Hits         3041     3067      +26     
+ Misses       2188     2172      -16     
- Partials      641      642       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@nevrome nevrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code. Looks very good, overall. My main concerns are with the interface, but maybe I just don't understand exactly what you wanted to achieve.

parseStrandCheck = OP.switch (
OP.long "strandCheck" <>
OP.help "Whether to allow strand flips in the genotype data. Note that this will remove \
\any A/T and G/C SNPs from the data, as for those we cannot determine the correct strand orientation.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the interface in general, as a slightly oblivious user myself:
Does that mean the default behaviour of trident does not change, if this option is not active? Is this a good default? Should --strandCheck not be active by default?

You write

If a user suspects strand flips (they will notice because the merged number of SNPs will be very low in case there are strand flips but trident assumes none) ...

but I wonder if I myself would actually notice this. Should there not be a warning in these cases? Maybe a little report at the end of the forging operation, that reports the number of lost SNPs and makes suggestions like "use --strandCheck" under certain conditions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the setting right now is such that the default behaviour does not change, and by default we i) assume that all SNPs live on the same strand, and ii) that we therefore do not have to check or flip anything.

There are arguments for setting a different default, I will raise this in a new comment below.

blocks <- liftIO $ catch (
runSafeT $ do
eigenstratProd <- getJointGenotypeData logA False relevantPackages Nothing
eigenstratProd <- getJointGenotypeData logA False False relevantPackages Nothing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now what you mean by

they should first forge (and thereby correct strand flips), and then run any additional command. At least I as a user would find it a little bit "too smart", if trident where able to correct strand flips somewhat under the hood implicitly every time it reads in multiple datasets...

I was wondering which subcommand you had in mind, completely forgetting about xerxes 🤦. Good that we have this code in here now, and can not delay this decision!

I'm not sure about it, honestly. But I feel xerxes should at least tell me that there is a potential issue if I run it on malformed data. And if I could run --strandCheck here, instead of tediously forging everything beforehand, I would probably appreciate the convenience. 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, I expected this opinion from you 😁... and I was going back and forth on this. OK, let's discuss this below more fully.

Comment thread src/Poseidon/CLI/Xerxes/FStats.hs
Comment thread src/Poseidon/Core/GenotypeData.hs
Comment thread src/Poseidon/Core/GenotypeData.hs Outdated
Comment thread src/Poseidon/Core/GenotypeData.hs Outdated
Comment thread test/testDat/testPackages/test_strandflips/POSEIDON.yml Outdated
Comment thread poseidon-hs.cabal Outdated
stschiff and others added 2 commits May 5, 2026 13:17
@stschiff
Copy link
Copy Markdown
Member Author

stschiff commented May 5, 2026

OK, so @nevrome raised three important points here:

Should --strandcheck actually be the new default behaviour?

So, in ADMIXTOOLS's mergeit it actually is indeed the default to perform strandchecks. This means that by default, C/G and A/T SNPs are kicked out when merging two packages.

But is this a good default behaviour for trident? Hmm, well, it certainly is the safer option! But I think it would be a poor default, for the simple reason that our philosophy is to support package archives, with lots of packages that are constantly viewed together. It is reasonable to assume that the default case for genotype-merges are happening within well-defined multi-package archives, in which strand-checks can be excluded by construction.

In any case, making it the default would be a massive breaking change: Any command that involves merging, either inside forge or any other joint-loading, would now return a largely reduced dataset with A/T and C/G SNPs removed. I don't think we can do this.

Should the strandcheck option be included in other commands that involve joint-genotype loading, such as inside xerxes?

I voted against it, but maybe it should indeed be allowed. I am open for that

If do not make strandcheck the default, should there be some detection?

Yes, I think that is a great idea! Note, though, that we already warn here through the joinEntryPipe, which reports errors that emerged in the merging step. And this line here actually produces an error if inconsistent SNPs are found. So we can consider two non-exclusive options:

  1. Elevate this warning to an actual error, stopping the merging. This may be a problem if you're trying to merge a dataset that just has a few really faulty SNP entries that you would prefer just to be skipped. I don't know whether that's a real use-case. We could then also offer a --skipErrors option for people who know what they are doing.
  2. We can add to the error message due to inconsistent alleles a line to possibly try --strandcheck, as a hint.

I think I like both of these option and propose to do them both now. I will test this with a real dataset that I know contains strand-flipped alleles. What do you think?

Co-authored-by: Copilot <copilot@github.com>
@nevrome
Copy link
Copy Markdown
Member

nevrome commented May 6, 2026

Ok - thanks for the explanation. I think what you already partially implemented in 46b17ac is the right way to go: Stop if incongruent SNPs are detected, and force the user to make a decision how to deal with them. That's the kind of hand-holding I appreciate.

If I understand correctly, then there may be issues that --strandcheck can not fix, right? So in certain situations one may want to call both --strandcheck and --skip-incongruent-snps? And this may then also apply for the xerxes subcommands?

I'm also wondering about the naming of these options:

  1. --skip-incongruent-snps is fine and clear, but all other trident CLI arguments use camel case. So maybe --skipIncongruentSNPs would be more consistent.
  2. --strandcheck is misleading, because it does not only check, but also filter. When thinking about a better name I was even wondering if it does not just conflates two separate tasks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants