Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the binary message format? #2496

Open
twittner opened this issue Feb 11, 2022 · 8 comments
Open

Changing the binary message format? #2496

twittner opened this issue Feb 11, 2022 · 8 comments

Comments

@twittner
Copy link
Contributor

Currently Ockam messages are encoded with BARE, a binary data format that optimises for compactness and simplicity. It does that by requiring external schema information which allows it to not encode a lot of structural information. Apart from actual data such as string values or numbers it only adds some length information where necessary, but not much else. Insofar the format has only minimal overhead. The flip side of this approach is that without the correct schema, BARE messages can not be decoded. This prevents generic message processing and makes it difficult to have support for forward and backward compatibility1 as it is not possible to skip over message parts without knowing their meaning. The ambiguities that may arise surface in functions like ockam_node::parser::message which retries message decoding should the first decoding attempt fail, by modifying the message. The appendix below contains further examples that highlight the importance of accurate schema information.

Many formats work with schema information and it is a great way to save precious bytes, however it would be nice if in addition, messages have enough embedded structure so that forward and backward compatibility is possible and maybe also generic processing which can help with debugging or logging.

Are there alternatives to BARE which offer this without sacrificing any of the other important requirements, i.e.

  1. a compact, simple and language agnostic format, with
  2. simple codecs that have minimal resource requirements, which are capable of
  3. safe message processing that handles malformed input gracefully without causing resource exhaustion or violating memory safety?

I would like to argue for CBOR here which is not as compact as BARE, but should otherwise offer all of this.

In contrast to BARE, CBOR does encode more structural information which increases message size. Every CBOR data item is prefixed with a byte that encodes the item type as well as additional information (mostly about how to load the data, e.g. length information). This enables generic processing of arbitrary data items, for example to map them to a human readable format. This also makes it possible to skip over message parts which in turn allows for forward and backward compatibility. The interesting question is how much overhead does this cost in practice? In https://github.com/twittner/encoding-size I attempt to do a comparison between BARE and CBOR. As data I use randomly generated trees of structs and enums with additional data items such as strings and numbers. Here is an example run:

>>> TINY
BARE = 53 bytes ( 82.81%)
CBOR = 64 bytes (120.75%)
>>> SMALL
BARE = 14754 bytes ( 92.56%)
CBOR = 15940 bytes (108.04%)
>>> MEDIUM
BARE = 800402 bytes ( 97.13%)
CBOR = 824074 bytes (102.96%)
>>> LARGE
BARE = 15972854 bytes ( 99.35%)
CBOR = 16077387 bytes (100.65%)

The labels "TINY", "MEDIUM", etc. classify the ratio of plain data sizes (e.g. the length of strings) to the overall structure, for example TINY means that the message contains very short strings and vectors. As expected the overhead of CBOR is higher the more structural information is encoded relative to plain data items. However I think the differences are fairly small, especially when considering absolute numbers.

More interesting than this is how this would work out for the ockam repository. In here I moved most of the message types from BARE to CBOR and printed the message lengths from ockam_core::message::Encodable::encode for the develop branch as well as for my CBOR branch while running cargo test --all -j 1 -- --test-threads=1 --nocapture. On average, BARE encoded messages are 256 bytes in size, whereas the CBOR encoded ones measure 285 bytes on average, confirming the above results (89%, 111%). (The CSV with all printed sizes can be found here).

If the increase in size is acceptable and there is agreement that it would be desirable to have generic message processing and the possibility of forward and backward compatibility, the branch could be polished and become a pull request. However I am just exploring a possibility here and would like to discuss this first before investing more time.

For further information about CBOR see https://cbor.io. Forward and backward compatibility with application to CBOR is described in more detail here.

Appendix

Example 1

Flattened structs
use serde::{Serialize, Deserialize};

fn main() {
    // Structs in BARE are always flattened. Members are just concatenated and
    // the nesting is not reflected in the format. This makes struct values
    // potentially very very compact.

    #[derive(Debug, PartialEq, Eq, Serialize, Deserialize)]
    struct S(String);

    #[derive(Debug, PartialEq, Eq, Serialize, Deserialize)]
    struct T(S, bool);

    let t1  = T(S("hello".to_string()), true);
    let buf = serde_bare::to_vec(&t1).unwrap();
    let t2: T = serde_bare::from_slice(&buf).unwrap();
    assert_eq!(t1, t2);

    // The resulting byte sequence looks like this:
    // ┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
    // │00000000│ 05 68 65 6c 6c 6f 01    ┊                         │•hello• ┊        │
    // └────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
    // Only the string value (prepended by its length) and the bool value are encoded.
    //
    // NB that the absence of schema information makes it possible to decode any struct
    // prefix on its own, e.g. we could parse this just as an `S` instead of a `T`:

    let s1: S = serde_bare::from_slice(&buf).unwrap();
    assert_eq!(t1.0, s1);

    // This breaks down of course when the suffix is not ignored, e.g. due to repetition:

    let buf = serde_bare::to_vec(&(&t1, &t1)).unwrap();

    // ┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
    // │00000000│ 05 68 65 6c 6c 6f 01 05 ┊ 68 65 6c 6c 6f 01       │•hello••┊hello•  │
    // └────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

    let s2: (S, S) = serde_bare::from_slice(&buf).unwrap();
    assert_eq!(s2, (S("hello".to_string()), S(String::from(char::from(5)))));

    // Here, after the first string, the next byte `01` is read as the length of the
    // next string and therefore the `05` is read as a UTF-8 string value.
    //
    // This shows how important the schema information is. Without an accurate schema
    // the decoding can easily go astray.
}

Example 2

The void type
use serde::{Serialize, Deserialize};

fn main() {
    // Empty structs and the unit type are not encoded, i.e. in RFC terms their
    // schema is the void type. According to section 2.4, void types "MUST NOT
    // be used as an optional type, struct member, list member, map key, or map
    // value. Void types may only be used as members of the set of types in a
    // tagged union."
    //
    // When asked about this restriction the author answered with "To avoid
    // getting to futzy with the type system. I want BARE message specifications
    // to be reasonably close to what you're going to see on the wire. It's
    // designed to avoid getting too clever with types. And yes, some languages
    // don't have unit types, and this helps with broader compatibility." [1]
    //
    // Despite this, serde_bare accepts empty structs and treats them as void,
    // i.e. it does not encode them:
    #[derive(Debug, PartialEq, Eq, Serialize, Deserialize)]
    struct A;

    #[derive(Debug, PartialEq, Eq, Serialize, Deserialize)]
    struct B(A);

    #[derive(Debug, PartialEq, Eq, Serialize, Deserialize)]
    struct C(B);

    let c1  = C(B(A));
    let buf = serde_bare::to_vec(&c1).unwrap();
    assert!(buf.is_empty());
    let c2: C = serde_bare::from_slice(&buf).unwrap();
    assert_eq!(c1, c2);
}

// [1]: https://lists.sr.ht/~sircmpwn/public-inbox/%3CC106F0AA-BB7D-4B8C-A105-82C0F6BB4687%40taglialegne.it%3E#%3CC4HL2XIJK0XA.1OU5CCJ34OVTF@homura%3E

Footnotes

  1. Section 4 of the RFC draft of BARE mentions schema evolution in a backward and forward compatible way by using tagged unions with explicit tag numbers. While adding another tagged branch is backward compatible because newer software understands the messages of older software, the reverse is not true: a message using this new tag will be rejected by older software, hence forward compatibility is not realised.

@mrinalwadhwa
Copy link
Member

@twittner thank you for starting this discussion and writing such an insightful analysis ❤️

I'll spend more time with your notes soon, but here's one idea that we should consider ...

Certain messages on the routing/messaging layer (e.g TransportMessage or SecureChannel's data) are wrapped inside each other multiple times .. to support tunneling.

For such messages ... overheads compound quickly, and compactness is more important .. especially to maintain support for low-power wireless protocols. It would be okay to compromise on ease of use at this low-level- if we have to. Safety is critical at this layer but the surface we will have to check is relatively small.

As we move up layers to protocols that are end-to-end, compactness is less important because the cost is not compounding. Ease of use is significantly more important. Safety critical surface is relatively large.

I think we should consider CBOR at these higher level layers - but keep BARE (or something similarly compact) at the routing/messaging layer.

@thomcc
Copy link
Contributor

thomcc commented Feb 11, 2022

Fascinating -- love that you actually approached this by measuring, and the change looks very nice.

I also wasn't aware of https://docs.rs/minicbor-derive/latest/minicbor_derive/index.html and impressed by the protobuf-style forward/backward compatibility -- very cool. (Honestly, I wasn't aware that you could replace the field names with indices at all -- that seems very useful).

Some questions I have are:

  1. Does the relatively low size cost you demonstrate hold up with types different than what we currently use in Ockam. For example, Vec<i32> or Vec<String>?

  2. There are cases where we wrap and re-wrap messages multiple times for serializaion. For bare, this, in it's plainest form of repeatedly applying compression to a Vec<u8> (e.g. just to measure the part of the size cost that isn't "our fault"), ends up being about 8 bytes of overhead each time. How does CBOR scale under repeated re-encoding.

  3. I'm familiar with Packed CBOR, which I believe was motivated by a desire to reduce CBOR message sizes. While I have some... doubts about Packed CBOR (I think they should just use real compression), I had figured the motivation was genuine. Is the difference here just the use of field indices and not names? Is there something I'm missing?

Also, I must admit, I'm a little unsure what point you're getting at with the two examples. Can you explain them a bit more?

Some other points:

This prevents generic message processing and makes it difficult to have support for forward and backward compatibility as it is not possible to skip over message parts without knowing their meaning

Yeah, this is a big issue in practice -- as it's very easy to #[cfg(...)] away a field of a struct, and it's tedious to write serde deserializers that handle this. In #2492 I have this issue, and note it here.

@antoinevg
Copy link
Contributor

This is a great analysis @twittner, I hadn't come across CBOR before.

I'd also be interested to see numbers for encoding efficiency with nested messages as @mrinalwadhwa mentioned!

My biggest outstanding question is around implementing CBOR on embedded targets.

  • Are there any codec implementation available that can build on no_std targets?
  • If there are, can any of them also be used in a heapless, statically allocated environment? (i.e. does not perform runtime memory allocation or require the alloc library)

@SanjoDeundiak
Copy link
Member

SanjoDeundiak commented Feb 11, 2022

Has anyone looked at RION? Their design goals look tempting
Also I could argue for protobuf. Having language-independent schema that can generate code for you is very useful. We already came across some problems serializing messages between Rust <-> Elixir. Imagine if we had 10 implementations - adding new message struct would require a lot of attention and testing.

@twittner
Copy link
Contributor Author

Thank you all for your comments! I am going to reply in one go:

@mrinalwadhwa: Certain messages on the routing/messaging layer (e.g TransportMessage or SecureChannel's data) are wrapped inside each other multiple times .. to support tunneling. For such messages ... overheads compound quickly, and compactness is more important [...]

@thomcc: How does CBOR scale under repeated re-encoding.

@antoinevg: I'd also be interested to see numbers for encoding efficiency with nested messages as @mrinalwadhwa mentioned!

CBOR is less good than BARE as the structural overhead is repeated, but it depends on how much the structure dominates the message relative to plain data. To give a contrived example, let us consider this type:

#[derive(Encode, Decode, Serialize, Deserialize)]
struct S {
    #[cbor(n(0), with = "minicbor::bytes")] a: Vec<u8>
}

and use S to recursively encode itself:

fn wrap(rem: usize, to_vec: &impl Fn(S) -> Vec<u8>) -> S {
    if rem == 0 {
        S { a: Vec::new() }
    } else {
        S { a: to_vec(wrap(rem - 1, to_vec)) }
    }
}

BARE only encodes the vector length and its bytes whereas CBOR adds information about the structure itself. The byte lengths for the number of recursion steps are:

   0: BARE = 1 bytes (50.00%), CBOR = 2 bytes (200.00%)
   1: BARE = 2 bytes (50.00%), CBOR = 4 bytes (200.00%)
...
  10: BARE = 10 bytes (50.00%), CBOR = 20 bytes (200.00%)
...
 100: BARE = 100 bytes (33.56%), CBOR = 298 bytes (298.00%)
...
1000: BARE = 1872 bytes (48.02%), CBOR = 3898 bytes (208.23%)

The only thing we encode here is essentially (serialised) structural information and clearly BARE is more compact than CBOR. However I do maintain that types more commonly used, which also contain plain data such as strings are less pathological. If we change S for instance to also contain the string "hello world" as a second field the picture changes to the more familiar numbers I had reported previously:

   0: BARE = 13 bytes (92.86%), CBOR = 14 bytes (107.69%)
   1: BARE = 26 bytes (92.86%), CBOR = 28 bytes (107.69%)
...
  10: BARE = 130 bytes (87.84%), CBOR = 148 bytes (113.85%)
...
 100: BARE = 1390 bytes (87.97%), CBOR = 1580 bytes (113.67%)
...
1000: BARE = 13990 bytes (87.55%), CBOR = 15980 bytes (114.22%)

@thomcc: Does the relatively low size cost you demonstrate hold up with types different than what we currently use in Ockam. For example, Vec<i32> or Vec<String>?

In https://github.com/twittner/encoding-size/blob/master/src/main.rs#L13-L39 I tried to use a fairly general and recursive type that also uses a Vec<u16> and a Vec<String>, so the numbers I cited above should apply. Of course a Vec<T> with T not being a u8 takes up more space because in CBOR this will be an array of Ts and each element is prefixed with type information, whereas a Vec<u8> can be encoded more efficiently as a CBOR bytes type. It is also noteworthy that integers are always of variable length in CBOR, so even a u64 may end up occupying just one byte for very small values.

@thomcc: Packed CBOR [...] Is the difference here just the use of field indices and not names? Is there something I'm missing?

I think you understood correctly. I have not used packed CBOR here, but simply indices which in this case are mapped to array positions, but could also be used as keys in CBOR maps if a type is expected to have many gaps.

@thomcc: Also, I must admit, I'm a little unsure what point you're getting at with the two examples. Can you explain them a bit more?

Example 1 was meant to show how easy it is to be unaware of the boundaries between types. I stumbled across this when replacing BARE with CBOR because I got an error at https://github.com/ockam-network/ockam/blob/47a29d20081b52aca162aba9c3d19c88a69f6165/implementations/rust/ockam/ockam_transport_tcp/src/workers/sender.rs#L221 where the payload is decoded as a TransportMessage. IIUC the message is actually a LocalMessage and it surprised me at first that the decoding succeeded. Example 1 is really just an artefact of this investigation and for people familiar with BARE probably not very interesting.

Example 2 is about the void type and my confusion because I think serde_bare treats empty structs as void, but in places where the RFC states they must not be used (e.g. as struct fields, optional values etc). Again the example is more of a note and maybe I have misread the RFC.

@antoinevg: Are there any codec implementation available that can build on no_std targets? If there are, can any of them also be used in a heapless, statically allocated environment? (i.e. does not perform runtime memory allocation or require the alloc library)

Yes. The crate minicbor is meant to be used in a no_std environment and non-allocating.

@SanjoDeundiak: Has anyone looked at RION?

So far I have not but it seems to be similar to CBOR and MessagePack. I could not find a Rust implementation though. Do you know if there is one?

@SanjoDeundiak: Also I could argue for protobuf.

Yes, protocol buffers are nice in a way. I am not too fond of the code that is generated from the external schema and the schema language is quite restrictive compared to what is possible within programming languages, e.g. when it comes to generic types.

@SanjoDeundiak: We already came across some problems serializing messages between Rust <-> Elixir. Imagine if we had 10 implementations - adding new message struct would require a lot of attention and testing.

👍 I did not focus on cross-language communication so far but it is absolutely important and with protocol buffers the schema usually comes first which takes care of that. With serialisation attributes on the other hand, be it serde or minicbor, the schema is embedded in the types and communicating this information correctly to another programming language is comparatively difficult. At least, even with formats that do not require a schema one can often still make use of one. For CBOR a schema description language CDDL has been standardised and there are tools to check that a CBOR item matches a schema. It still requires extra work compared to using protobuf which generates code that is guaranteed to conform to a schema.

@spacekookie
Copy link
Contributor

Thank you @twittner for starting this discussion. We definitely have a bunch of foot-guns in the Rust code where we hack around weird BARE encoding mechanisms. We're also currently figuring out how to do encoding of metadata for higher level protocols so the timing for this discussion is impeccable 🙂

I haven't worked with CBOR before but it reminds me of msgpack a lot. Having a self-describing encoding format imo only has upsides. The differences in size for simple datastructures are trivial in most cases. We will want to investigate the overhead this causes on embedded systems. But I also think that this is memory worth spending. Maybe instead we need to focus on optimising the runtime/ routing mechanisms themselves to reduce our memory footprint.

CBOR is less good than BARE as the structural overhead is repeated, but it depends on how much the structure dominates the message relative to plain data. To give a contrived example, let us consider this type:

While this is a pretty big limitation, I don't think we need to worry about it too much since a lot of our message nesting actually re-encodes payloads. We do this to create decoupling between workers to make the Rust type system happy, but it comes with the added benefit that, while we have deeply nested structures, encoding and decoding don't have to be fully aware of it.

Lastly: my experience with protobuf is that in most cases using it is overkill and causes more problems than it solves. We may want to investigate it nonetheless, but personally I think going with something like CBOR, which is well standardised and has implementations in many languages without relying on code-generators (meaning that writing a new implementation is also less daunting than writing a protobuf implementation), is the better way to go.

@antoinevg
Copy link
Contributor

Having a self-describing encoding format imo only has upsides. The differences in size for simple datastructures are trivial in most cases. We will want to investigate the overhead this causes on embedded systems. But I also think that this is memory worth spending. Maybe instead we need to focus on optimising the runtime/ routing mechanisms themselves to reduce our memory footprint.

Absolutely, there are far larger embedded memory optimizations we can perform before we even begin to approach message format contributions! 👍

@hairyhum
Copy link
Contributor

I think we will need to revive that discussion as we need to start implementing OpenTelemetry #2319 integration between Rust and Elixir as we need to accomodate telemetry metadata in the message format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants