Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase memory of quast #1073

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

paulzierep
Copy link
Contributor

Some quast jobs take forever, I assume this is due to low memory allowance (was 12 GB so far: https://github.com/galaxyproject/tpv-shared-database/blob/3be0403ffc960effd180c65fa0e2242dfe5e6aa9/tools.yml#L2121C1-L2123C12); but ideally I would like to work on a solution similar to #881 if an admin can query it for me.

@mira-miracoli
Copy link
Contributor

mira-miracoli commented Jan 16, 2024

I hope this makes sense and helps (my statistics course was already 6 years ago):

(venv) stats@sn06:~/mira$ cat quast  |tail -n +2| awk '{print$15}'| grep -o '[0-9]*'| histogram.py --percentage --max 265                                                                    
# NumSamples = 34055; Min = 0.00; Max = 265.00
# 7136 values outside of min/max
# Mean = 127811.4829246806636323594204; Variance = 102472553574407.7959718309167; SD = 10122872.79256278; Median 34                                                                          
# each ∎ represents a count of 195
    0.0000 -    26.5000 [ 14676]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 43.09                                                                          
   26.5000 -    53.0000 [  8138]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 23.90
   53.0000 -    79.5000 [  1425]: ∎∎∎∎∎∎∎ 4.18
   79.5000 -   106.0000 [   753]: ∎∎∎ 2.21
  106.0000 -   132.5000 [   834]: ∎∎∎∎ 2.45
  132.5000 -   159.0000 [   275]: ∎ 0.81
  159.0000 -   185.5000 [   238]: ∎ 0.70
  185.5000 -   212.0000 [   246]: ∎ 0.72
  212.0000 -   238.5000 [   159]:  0.47
  238.5000 -   265.0000 [   175]:  0.51

using https://github.com/mira-miracoli/data_hacks/blob/patch-1/data_hacks/histogram.py

@paulzierep
Copy link
Contributor Author

I don't fine the time to properly set up the rules atm, can we merge this PR for now ? @bgruening

@bgruening
Copy link
Member

Can you please that: https://github.com/usegalaxy-eu/infrastructure-playbook/pull/1073/files#diff-ff91c17e82694a84945958b09ddc38e4535d1f99ee1fb0ed594a8cd4fecceca7R733

Not very smart but better then allocating to much memory to every run.

@paulzierep
Copy link
Contributor Author

Can you please that: https://github.com/usegalaxy-eu/infrastructure-playbook/pull/1073/files#diff-ff91c17e82694a84945958b09ddc38e4535d1f99ee1fb0ed594a8cd4fecceca7R733

Not very smart but better then allocating to much memory to every run.

You mean to couple the memory on the input size? I think the problem is, that the deciding factor is mainly the content of the bacterial community (i.e. many species will lead to a lot of ram usage and few species few ram usage) ...
Best option I see atm is to only increase the memory for quast if the meta option is used. Is it possible to make rules based on the tool parameters ?

@bgruening
Copy link
Member

@paulzierep
Copy link
Contributor Author

The problem with this tool is, that memory is litte related to the input size. As shown in the stats figure.

input_vs_memory_quast

Since all jobs that reported issues where related to metagenomic analysis, this simple approach should work to only increase the memory for those jobs. Another input option that should be considered is the co and not co-assembly option.
Is it possbile to query specific tool parameters from the DB. Maybe if an admin could help me out with this one, I can try to investigate if the rules can be more fine grained for different parameters instead of inputs.

Another appoach in the long run could be to add inputs to the tool, that help to allocate memory, i.e. run kraken first on the tool (or count the number of contigs) and then based on these metrics decide what memory the jobs should get.

@paulzierep
Copy link
Contributor Author

Can an admin check what I did wrong. I used this as https://github.com/galaxyproject/tpv-shared-database/blob/efd5b95033bb59fa66d2d5f0d0c43edce2a1c24b/tools.yml#L438 template.

@sanjaysrikakulam
Copy link
Member

The problem with this tool is, that memory is litte related to the input size. As shown in the stats figure.

input_vs_memory_quast

Since all jobs that reported issues where related to metagenomic analysis, this simple approach should work to only increase the memory for those jobs. Another input option that should be considered is the co and not co-assembly option. Is it possbile to query specific tool parameters from the DB. Maybe if an admin could help me out with this one, I can try to investigate if the rules can be more fine grained for different parameters instead of inputs.

Another appoach in the long run could be to add inputs to the tool, that help to allocate memory, i.e. run kraken first on the tool (or count the number of contigs) and then based on these metrics decide what memory the jobs should get.

At an initial glance through the DB, I could not find a table that might contain the jobs/tools parameters individually or explicitly. However, the job table contains every job's command_line. You can extract this data for your tool of interest and then look into it (it might be a tedious process. I will try to dig through the codebase to see if the job parameters are stored anywhere explicitly).

SQL query:

select id, command_line from job where tool_id ilike '%toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/%';

You can add in additional conditions, for example, to look for jobs where type metagenome was used (based on my understanding of the tool's source, if someone uses the type metagenome, the metaquast would be used otherwise, quast)

SQL query:

select id, command_line from job where tool_id ilike '%toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/%' and command_line ilike '%meta%';

@sanjaysrikakulam
Copy link
Member

sanjaysrikakulam commented Feb 1, 2024

OK, found the table job_parameter through which you can get the parameters for every job

select * from job_parameter where job_id=<enter the job id here>;

I mapped the job ID to a different column in the job_parameter table earlier and got empty results (hence, I didn't share the info about the table), but I just realized my mistake, so here is a solution.

Also, you can join the previously posted SQL query with this one as well,

select jp.* from job_parameter jp inner join job j on jp.job_id = j.id where j.tool_id ilike '%toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/%';

Comment on lines 738 to 741
- id: metagenome
if: helpers.job_args_match(job, app, {'assembly': {'type': 'Metagenome'}})
cores: 20
mem: 80
Copy link
Contributor

@mira-miracoli mira-miracoli Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be assembly.type?
(I just briefly had a look in the wrapper, I coud also be wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants