Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide custom script for merging MetaPhlAn tables for better sample name handling #574

Open
alexhbnr opened this issue Jan 30, 2025 · 2 comments
Labels
enhancement Improvement for existing functionality

Comments

@alexhbnr
Copy link
Contributor

Description of feature

Currently, the nf-core module mergemetaphlantables uses the script merge_metaphlan_tables.py that comes along the MetaPhlAn software. This script takes a number of MetaPhlAn profiles as input and merges them using some basic merge functionality of Python's pandas module.

Prior to merging, the script determines the sample name of the profile by parsing the filename and removing the file extension and the addition _profile: https://github.com/biobakery/MetaPhlAn/blob/b7e6670831f4842afdf3b0a8531a6f676ed56c45/metaphlan/utils/merge_metaphlan_tables.py#L36
Applying this to the filenaming scheme used by taxprofiler, for which the MetaPhlAn profiles have filenames following the scheme <sample name>_<database name>.metaphlan_profile.txt, this leads to the case that each sample name will be <sample name>_<database name>.metaphlan.

While the nf-core module mergemetaphlantables does the job of merging the tables, I as the user have to manually edit this merged tables and clean the sample names when I don't want to have the database name and the suffix .metaphlan in the sample names.

Therefore, I would suggest that it would make sense to either replace the MetaPhlAn script merge_metaphlan_tables.py with a custom script that can handle the divergent filename pattern introduced by nf-core/taxprofiler or adding some code, e.g. sed, to remove the additional suffix.

@alexhbnr alexhbnr added the enhancement Improvement for existing functionality label Jan 30, 2025
@Midnighter
Copy link
Collaborator

Have you tried enabling taxpasta standardisation for this purpose? Was that output more helpful to you?

@alexhbnr
Copy link
Contributor Author

Yes, I indeed have. But currently taxpasta seems to fail on MetaPhlAn output, at least for the tables I was trying to run it on. It is related to the bug here: taxprofiler/taxpasta#140

merge_metaphlan_tables.py does a simple join on the name of the clades and ignores the NCBI tax ids. I am currently more happy with this sort of merging than the one discussed in the issue above, in which one would sum up all taxa without a tax id to a new category called "unclassified".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants