Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4032: [Rust] Make it possible to set custom compression level for all codec that support it #3095

Merged
merged 12 commits into from
Aug 15, 2024

Conversation

MarcoLugo
Copy link
Contributor

@MarcoLugo MarcoLugo commented Aug 13, 2024

AVRO-4032

What is the purpose of the change

This PR adds support for picking a custom compression level for the Zstandard codec. This change does not remove the Zstandard codec variant in order to keep backwards compatibility and a slightly simpler interface for the default case.

Verifying this change

Added new tests:

  • Compression and decompression, analogous to other codec tests.
  • A test that makes sure that the codec is converted to a string as zstandard (i.e. same as Zstandard codec) as this is still the same codec.

Documentation

  • Does this pull request introduce a new feature?
    Yes.
  • If yes, how is the feature documented?
    Simple description available in cargo-generated documentation. Validated it using cargo doc --open --no-deps --all-features.

@github-actions github-actions bot added the Rust label Aug 13, 2024
Comment on lines 52 to 58
/// The `Zstandard` codec uses Facebook's [Zstandard](https://facebook.github.io/zstd/) with the
/// default compression level.
Zstandard,
#[cfg(feature = "zstandard")]
/// This codec is the same as `Zstandard` but allows specifying the compression level.
#[strum(disabled)]
ZstandardWithLevel(ZstandardLevel),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are in 0.x state, I don't think we need to avoid breaking changes. Directly changing Zstandard looks cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!
I'd suggest to go even further:

  1. introduce ZstandardOptions (or ZstandardConfig) struct that provides all possible settings - the level and a dictionary. With a Default impl that uses 0 and &[]
  2. the same for all other codecs with their respective settings

WDYT ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 71021f2 I've pushed codec settings for Bzip2. Please pull before working on the settings for Zstd!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea! I'll go ahead with it for zstd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also pushed similar settings for Xz!

Snappy does not provide APIs for setting compression level (or anything other), so there is no need of changes!

Deflate provides its own EncodeOptions but we cannot use it in our Codec enum because it does not implement several traits (Clone, Copy, Eq and PartialEq) ... Constructing EncodeOptions from its fields is not possible because it does not provide such APIs.

So, only Zstd is left! I'll leave it to you @MarcoLugo !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a similar struct for zstd but for now I ended up leaving out the dictionary as it would be required for decoding as well and so it would need to go into the header metadata and I'm not sure how to do that at the moment. I'll need to read the avro spec in more detail, maybe this could be a future PR in order to avoid blocking this one?
I also made the new settings structs public so they're visible in the docs and can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary is a Zstd thingy. It just needs to be the same for both encoding and decoding.
Just pushed 8f4a2f3 for it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to revert the support for dictionary. Dealing with the slice causes a lot of lifetime headaches.
I am about to commit an improvement to the (de)ser of the codec in the Avro binary metadata. We need to store the settings along with the codec name!

Copy link
Member

@martin-g martin-g Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I hit a problem!

By Avro spec when writing the binary data (.avro) the codec is stored in the header's metadata as a string, e.g. "zstandard". But there is nothing about codec's metadata!
So, if you do something like :

let schema = Schema::parse_str(
        r#"
        {
            "type": "record",
            "name": "Test",
            "fields": [
                {"name": "f1", "type": "int"},
                {"name": "f2", "type": "string"}
            ]
        }"#,
    )?;

    let mut writer = Writer::with_codec(&schema, Vec::new(), codec);
    let mut record = Record::new(writer.schema()).unwrap();
    record.put("f1", 27_i32);
    record.put("f2", "foo");
    writer.append(record)?;
    let input = writer.into_inner()?;
    let mut reader = Reader::new(&input[..])?;
    ...   

Then the Reader will not know anything about the codec's settings used by the writer! The Reader will use the codec with its default settings!
We can store the settings in the metadata's user_metadata, see reader.rs > Block struct. But this will take some more time!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I was fearing earlier regarding the dictionary though I am admittedly less familiar with the spec. I'd be in favor of having the zstd dictionary as a separate ticket/PR. This ticket initially it was about being able to set the level which I think is already a good improvement and we got the ability to set the level for more codecs than zstd too :)

@martin-g
Copy link
Member

martin-g commented Aug 15, 2024

I just noticed that Codec::iter() (generated by strum) returns only Null and Deflate and I am not sure why the others are not returned...
It behaves the same way in main branch, so it is not a regression.

Update: my fault! I needed to enable the features ...

cargo run --example generate_interop_data --all-features
   Compiling apache-avro v0.18.0 (/home/martin/git/apache/avro/lang/rust/avro)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.75s
     Running `target/debug/examples/generate_interop_data`
Wrote ../../build/interop/data/rust.avro
Wrote ../../build/interop/data/rust_deflate.avro
Wrote ../../build/interop/data/rust_snappy.avro
Wrote ../../build/interop/data/rust_zstandard.avro
Wrote ../../build/interop/data/rust_bzip2.avro
Wrote ../../build/interop/data/rust_xz.avro

All looks good!

@martin-g
Copy link
Member

I think I am ready!
@MarcoLugo Please test with your application and let me know how it goes!

use bzip2::Compression;

#[derive(Clone, Copy, Eq, PartialEq, Debug)]
pub struct Bzip2Settings {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: do we need **Settings for each codec ?
All 3 classes are the same now.
If we need to add custom codec specific settings later it should not be a big deal - the other codecs will just ignore any setting they don't support. I do not expect many new settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because the codecs use different compression level by default and if more than one feature is enabled then it is not possible to decide what to use as a default.

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
This way there is no need to enable other codecs when running unrelated tests.
Otherwise the build fails with:
```
error: function `avro_4032_codec_settings` is never used
  --> avro/tests/codecs.rs:50:4
   |
50 | fn avro_4032_codec_settings(codec: Codec) -> TestResult {
   |    ^^^^^^^^^^^^^^^^^^^^^^^^
```

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
@martin-g martin-g changed the title AVRO-4032: [Rust] zstd level AVRO-4032: [Rust] Make it possible to set custom compression level for all codec that support it Aug 15, 2024
@MarcoLugo
Copy link
Contributor Author

I think I am ready! @MarcoLugo Please test with your application and let me know how it goes!

It looks good to me, thank you @martin-g ! It worked well when I used it in my application.

@martin-g martin-g merged commit 9379fdb into apache:main Aug 15, 2024
15 checks passed
@martin-g
Copy link
Member

Thank you too, @MarcoLugo !

martin-g pushed a commit that referenced this pull request Aug 15, 2024
…r all codec that support it (#3095)

* AVRO-4032: add zstd variant that allows for level config

* add docs

* manual impl for IntoStaticStr to handle 2 variants with same value

* AVRO-4032: [Rust] Add codec settings for Bzip2

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* AVRO-4032: [Rust] Add settings for Xz codec

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* add zstd settings

* make our new settings structs visible

* AVRO-4032: [Rust] Add support for Zstandard dictionary

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* AVRO-4032: [Rust] Store the codec compression_level in the header metadata

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* AVRO-4032: [Rust] Minor cleanups

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* AVRO-4032: Fix the build

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

* AVRO-4032: Add tests for Codec::Null and ::Deflate

This way there is no need to enable other codecs when running unrelated tests.
Otherwise the build fails with:
```
error: function `avro_4032_codec_settings` is never used
  --> avro/tests/codecs.rs:50:4
   |
50 | fn avro_4032_codec_settings(codec: Codec) -> TestResult {
   |    ^^^^^^^^^^^^^^^^^^^^^^^^
```

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>

---------

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Co-authored-by: Martin Tzvetanov Grigorov <[email protected]>
(cherry picked from commit 9379fdb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants