-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-4032: [Rust] Make it possible to set custom compression level for all codec that support it #3095
Conversation
lang/rust/avro/src/codec.rs
Outdated
/// The `Zstandard` codec uses Facebook's [Zstandard](https://facebook.github.io/zstd/) with the | ||
/// default compression level. | ||
Zstandard, | ||
#[cfg(feature = "zstandard")] | ||
/// This codec is the same as `Zstandard` but allows specifying the compression level. | ||
#[strum(disabled)] | ||
ZstandardWithLevel(ZstandardLevel), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are in 0.x
state, I don't think we need to avoid breaking changes. Directly changing Zstandard
looks cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
I'd suggest to go even further:
- introduce ZstandardOptions (or ZstandardConfig) struct that provides all possible settings - the level and a dictionary. With a
Default
impl that uses0
and&[]
- the same for all other codecs with their respective settings
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 71021f2 I've pushed codec settings for Bzip2. Please pull before working on the settings for Zstd!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent idea! I'll go ahead with it for zstd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also pushed similar settings for Xz!
Snappy does not provide APIs for setting compression level (or anything other), so there is no need of changes!
Deflate provides its own EncodeOptions
but we cannot use it in our Codec
enum because it does not implement several traits (Clone, Copy, Eq and PartialEq) ... Constructing EncodeOptions from its fields is not possible because it does not provide such APIs.
So, only Zstd is left! I'll leave it to you @MarcoLugo !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a similar struct for zstd but for now I ended up leaving out the dictionary
as it would be required for decoding as well and so it would need to go into the header metadata and I'm not sure how to do that at the moment. I'll need to read the avro spec in more detail, maybe this could be a future PR in order to avoid blocking this one?
I also made the new settings structs public so they're visible in the docs and can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary is a Zstd thingy. It just needs to be the same for both encoding and decoding.
Just pushed 8f4a2f3 for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to revert the support for dictionary. Dealing with the slice causes a lot of lifetime headaches.
I am about to commit an improvement to the (de)ser of the codec in the Avro binary metadata. We need to store the settings along with the codec name!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I hit a problem!
By Avro spec when writing the binary data (.avro) the codec is stored in the header's metadata as a string, e.g. "zstandard". But there is nothing about codec's metadata!
So, if you do something like :
let schema = Schema::parse_str(
r#"
{
"type": "record",
"name": "Test",
"fields": [
{"name": "f1", "type": "int"},
{"name": "f2", "type": "string"}
]
}"#,
)?;
let mut writer = Writer::with_codec(&schema, Vec::new(), codec);
let mut record = Record::new(writer.schema()).unwrap();
record.put("f1", 27_i32);
record.put("f2", "foo");
writer.append(record)?;
let input = writer.into_inner()?;
let mut reader = Reader::new(&input[..])?;
...
Then the Reader will not know anything about the codec's settings used by the writer! The Reader will use the codec with its default settings!
We can store the settings in the metadata's user_metadata
, see reader.rs > Block struct. But this will take some more time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was fearing earlier regarding the dictionary though I am admittedly less familiar with the spec. I'd be in favor of having the zstd dictionary as a separate ticket/PR. This ticket initially it was about being able to set the level which I think is already a good improvement and we got the ability to set the level for more codecs than zstd too :)
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
I just noticed that Update: my fault! I needed to enable the features ...
All looks good! |
…adata Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
I think I am ready! |
use bzip2::Compression; | ||
|
||
#[derive(Clone, Copy, Eq, PartialEq, Debug)] | ||
pub struct Bzip2Settings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder: do we need **Settings for each codec ?
All 3 classes are the same now.
If we need to add custom codec specific settings later it should not be a big deal - the other codecs will just ignore any setting they don't support. I do not expect many new settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work because the codecs use different compression level by default and if more than one feature is enabled then it is not possible to decide what to use as a default.
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
This way there is no need to enable other codecs when running unrelated tests. Otherwise the build fails with: ``` error: function `avro_4032_codec_settings` is never used --> avro/tests/codecs.rs:50:4 | 50 | fn avro_4032_codec_settings(codec: Codec) -> TestResult { | ^^^^^^^^^^^^^^^^^^^^^^^^ ``` Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
It looks good to me, thank you @martin-g ! It worked well when I used it in my application. |
Thank you too, @MarcoLugo ! |
…r all codec that support it (#3095) * AVRO-4032: add zstd variant that allows for level config * add docs * manual impl for IntoStaticStr to handle 2 variants with same value * AVRO-4032: [Rust] Add codec settings for Bzip2 Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * AVRO-4032: [Rust] Add settings for Xz codec Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * add zstd settings * make our new settings structs visible * AVRO-4032: [Rust] Add support for Zstandard dictionary Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * AVRO-4032: [Rust] Store the codec compression_level in the header metadata Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * AVRO-4032: [Rust] Minor cleanups Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * AVRO-4032: Fix the build Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> * AVRO-4032: Add tests for Codec::Null and ::Deflate This way there is no need to enable other codecs when running unrelated tests. Otherwise the build fails with: ``` error: function `avro_4032_codec_settings` is never used --> avro/tests/codecs.rs:50:4 | 50 | fn avro_4032_codec_settings(codec: Codec) -> TestResult { | ^^^^^^^^^^^^^^^^^^^^^^^^ ``` Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> --------- Signed-off-by: Martin Tzvetanov Grigorov <[email protected]> Co-authored-by: Martin Tzvetanov Grigorov <[email protected]> (cherry picked from commit 9379fdb)
AVRO-4032
What is the purpose of the change
This PR adds support for picking a custom compression level for the
Zstandard
codec. This change does not remove theZstandard
codec variant in order to keep backwards compatibility and a slightly simpler interface for the default case.Verifying this change
Added new tests:
zstandard
(i.e. same asZstandard
codec) as this is still the same codec.Documentation
Yes.
Simple description available in
cargo
-generated documentation. Validated it usingcargo doc --open --no-deps --all-features
.