Skip to content

Wish: extract parts of gzip file #2

@ole-tange

Description

@ole-tange

My users often have huge .gz files that they would like to process in parallel.

Can gzrt be adapted so it can extract a valid gz-file in blocks?

Let us assume I have a 1 GB file.gz and I want to extract blocks of around 1 MB of compressed data. I want to do this in parallel. So first I want to identify positions where a valid gz-block starts:

$ gzrt --next-start-of-block 0
0
$ gzrt --next-start-of-block 1000000
1234888
$ gzrt --next-start-of-block 2000000
2123488
...
$ gzrt --next-start-of-block 999000000
999348877

The idea is to seek to the byte position and then identify the next valid gz-block. When it is identified, print the byteposistion and exit.

After identifying where blocks start I would then be able to extract from one block to another:

gzrt --from-byte 0 --to-byte 1234888 | my_program &
gzrt --from-byte 1234888 --to-byte 2123488 | my_program &
gzrt --from-byte 2123488 --to-byte 3212348 | my_program &
...
gzrt --from-byte 998374753 --to-byte 999348877 | my_program &

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions