-
Notifications
You must be signed in to change notification settings - Fork 24
Description
My users often have huge .gz files that they would like to process in parallel.
Can gzrt be adapted so it can extract a valid gz-file in blocks?
Let us assume I have a 1 GB file.gz and I want to extract blocks of around 1 MB of compressed data. I want to do this in parallel. So first I want to identify positions where a valid gz-block starts:
$ gzrt --next-start-of-block 0
0
$ gzrt --next-start-of-block 1000000
1234888
$ gzrt --next-start-of-block 2000000
2123488
...
$ gzrt --next-start-of-block 999000000
999348877
The idea is to seek to the byte position and then identify the next valid gz-block. When it is identified, print the byteposistion and exit.
After identifying where blocks start I would then be able to extract from one block to another:
gzrt --from-byte 0 --to-byte 1234888 | my_program &
gzrt --from-byte 1234888 --to-byte 2123488 | my_program &
gzrt --from-byte 2123488 --to-byte 3212348 | my_program &
...
gzrt --from-byte 998374753 --to-byte 999348877 | my_program &