Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify amount of memory (RAM) to be used #233

Open
Sabryr opened this issue May 1, 2021 · 10 comments
Open

Specify amount of memory (RAM) to be used #233

Sabryr opened this issue May 1, 2021 · 10 comments

Comments

@Sabryr
Copy link

Sabryr commented May 1, 2021

Is it possible to specify amount of memory (RAM) to be used instead of automatically detecting the amount of RAM?

@ruanjue
Copy link
Owner

ruanjue commented May 1, 2021

When given data and parameters, the memory usage is fixed. The program detects total RAM but won't make tradeoff between RAM and runtime.

@Sabryr
Copy link
Author

Sabryr commented May 1, 2021

Thank you for the answer. I am setting up wtdbg on our HPC cluster. The processing is submitted as a job. Each job should specify how much resources needed. For example:
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=256G

However when the job is submitted, wtdbg2 detect all the resources in the compute node and plan accordingly.
-- total memory 3170070156.0 kB
-- available 2944877052.0 kB
-- 128 cores
(I found -t option to limit the number of cores to be used, but as you say this is not possible for memory)

A user has circumvented this by occupying the whole node with all resources. This results in monitoring scripts reports enormous e resource wastage. I am trying to find a solution for this as your program seems to be the only realistic option for his pacbio reads.
I have tested with sample data and I could not find a way to inform wtdbg2 about this job resources limitation.

In addition, wtdbg2 is writing to disk very-frequently. I see that this may be to avoid exceeding RAM limitations. At the same time on some of our nodes with about 3Tb RAM we prefer if the user could do more work on RAM and access disk less.

Could you help me to set this up so I can help to solve this limitations. I would gladly provide any assistance and also contribute back the findings.

@ruanjue
Copy link
Owner

ruanjue commented May 2, 2021

Please ignore the message of RAM and cores, the only one option be affected is -t 0, where it means all cores, otherwise wtdbg2 run as itself regardless of how much of your resource. To avoid wtdbg2 writting too much information on your disk, you can add option --minimal-output. During the development of wtdbg2, I tends to use more RAM to speed it up instead of disk.

@Sabryr
Copy link
Author

Sabryr commented May 2, 2021

Thank you very much I will try this.

@Sabryr
Copy link
Author

Sabryr commented May 2, 2021

This is the comparison when using --minimal-output and before.

wtdbg2 -t 8 -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl

JobName AllocCPUS Time
wtdbg2 8 00:55:45
MaxDiskWrite AveDiskWrite MaxRSS
1557.05M 1557.05M 43671364K

wtdbg2 -t 8 --minimal-output -x rs -X 32 -g 32g -L 5000 -i ${INPUT_FILE} -fo axolotl

JobName AllocCPUS Time
wtdbg2 8 01:21:43
MaxDiskWrite AveDiskWrite MaxRSS
1358.22M 1358.22M 43668708K

--minimal-output makes it 30 min slower when everything is the same, with about 200Mb less average disk write.

I forked your repo , any recommendations on how can I test changing the disk write frequency ?

@ruanjue
Copy link
Owner

ruanjue commented May 3, 2021

Thanks for the information. With --minimal-output, wtdbg2 only write the core compressed results once to disk.

@Sabryr
Copy link
Author

Sabryr commented May 4, 2021

--minimal-output processing becomes slower, for reasons I do not understand so it is not giving me the outcome I was expecting. Which is to do more work on memory and write to disk at the end. So the intention is, if I avoid --minimal-output would wtdbg2 stop processing until the chunk is fully written to disk ? or will the processing continue while the data being written. In my end when the IO is high CPU usage reduces.
here - https://github.com/ruanjue/wtdbg2/blob/b77c5657c8095412317e4a20fe3668f5bde6b1ac/filewriter.h I see that you have implemented a parallel processing, but do you have any idea about my above observation ?

@ruanjue
Copy link
Owner

ruanjue commented May 4, 2021

Please have a look at the usage of wtdbg2 wtdbg2 --help.

 --minimal-output
   Will generate as less output files (<--prefix>.*) as it can

@Sabryr
Copy link
Author

Sabryr commented Jun 17, 2021

I was able to use your software to optimally use our HPC setup using a sample of Axolotl data. Thank you for that help. However, now when handling the real genome. "-x sq -X 80 -g 7.5g -L 5000 " input size 1.7 Tb, it is going to take about 80 days on a single node. So I was wondering whether wtdgb2 can use multiple nodes (mpi) ?

@ruanjue
Copy link
Owner

ruanjue commented Jun 18, 2021

Try -x rs -X 50 -g 7.5g for huge genome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants