-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory limit exceeded during vg autoindex for GCSA/LCP indexing #4404
Comments
What do you have in the VCF file? Usually when GCSA construction runs out of memory, it happens because the graph is too complex, there are too many variants in repetitive regions, or there are too many nondeterministic locations (where the first/last bases of reference and alternative nodes are identical). |
I used the command: |
I'm not very familiar with MUMmer, but I think the problem is that you have too much duplicated sequence in the graph. GCSA construction does not like that, because it can't collapse identical k-mers if they start from different positions in the graph. If you want to build a graph based on two aligned haplotypes, Minigraph-Cactus should be a better choice. You can then map reads using Giraffe, which is faster than |
Thank you so much for your advice! I will definitely try using Minigraph-Cactus to build the graph. I appreciate your help and will let you know how it goes after testing. Thanks again! |
Dear developer: there is still enough storage, but the task automatically terminates after running for one day, I would like to ask how to solve this situation? l only not obtain the sample.trans.spliced.gcsa and sample.trans.spliced.gcsa.lcp, other files are ok. Best Code: |
The error robustness bits of I would advise users to fix the disk availability issues rather than dig into the internals of |
Dear developers: Code: Erro log: |
I'm not able to see how the GCSA code can read a We don't have error checking at the close call for when we write these files: Line 351 in d6ea214
And I suppose it's possible for close() to fail and leave a bunch of 0 bytes visible in the file. But I can't see why that would happen unless the filesystem was almost exactly full and we couldn't successfully flush the very end of the file.
@jltsiren do you know how the GCSA library might be getting convinced to write a kmer length of 0 into its temp files? Maybe if it ends up processing an empty subgraph somehow? @ld9866 What do you get for If your particular distributed filesystem doesn't actually guarantee that immediately after a successful |
This might help with problems like #4404 (comment) by letting us tell whether the kmer files were actually accepted by the filesystem or not.
@adamnovak The error is from Each input file consists of one or more sections, and each section starts with a header that defines kmer length. The error seems to occur in the |
Dear developer: Now what can I do to solve this problem? Best |
df -T /home/lidong/Data/10.pantrans/TMP Filesystem Type 1K-blocks Used Available Use% Mounted on |
@jltsiren I guess these are really vg's temp files and not properly GCSA's. @ld9866 You have an approximately three hundred terabyte single block device? Can you describe that storage setup in more detail to convincingly rule it out as the source of the problem? How confident are you in the quality of its implementation? If the storage is working, and vg really is writing 0-size kmers into the GCSA input files, then someone has to go through vg's Are you able to share the input files you are using? As a workaround, you might be able to generate the indexes you need with a series of vg commands instead of using the |
Dear developer: I have tested the process of extracting a chromosome and the construction is OK. I guess it may be caused by too much variation information. Do you have any good way to keep the effective variation information? Or is there any way you can build an index file for a very large set of variations Best |
Yes, there is some documentation about how to make the spliced pangenome graph with Manual GCSA construction is documented here: https://github.com/vgteam/vg/wiki/Index-Construction#complex-graph |
Hello,
I am encountering an issue when running vg autoindex to construct a graph from a HG002 reference FASTA and VCF file. The command I am using is as follows:
vg autoindex --workflow map --threads 24 --prefix /public1/home/sc30852/HG002/vg/graph --ref-fasta ../../hg002.mat.fasta --vcf ../mat.vcf.gz
Here is part of the log output:
[IndexRegistry]: Checking for phasing in VCF(s).
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[IndexRegistry]: Constructing VG graph from FASTA and VCF input.
[IndexRegistry]: Constructing XG graph from VG graph.
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Memory use of file 5 of kmer paths (503.81 GB) exceeds memory limit (503.781 GB).
It seems like the memory consumption during the GCSA indexing step exceeds the available memory (around 504 GB). Do you have any suggestions on how I can reduce memory usage, or is there a way to chunk the input differently to avoid this issue?
Any help would be appreciated!
Thank you!
The text was updated successfully, but these errors were encountered: