Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from using Vec<u8> to represent raw content key/value #1404

Open
18 tasks
morph-dev opened this issue Aug 26, 2024 · 23 comments
Open
18 tasks

Move away from using Vec<u8> to represent raw content key/value #1404

morph-dev opened this issue Aug 26, 2024 · 23 comments
Labels
flamingo Maintenance or downtime for the person on Flamingo rotation to tackle good-community-issue good-first-issue Good for newcomers refactoring Code refactoring

Comments

@morph-dev
Copy link
Collaborator

morph-dev commented Aug 26, 2024

Instead of using Vec<u8> to represent raw content keys/values, we should use something like alloy_primitives::Bytes because:

  • copy is faster
  • (de)serializes into hex string correctly

We should either wrap or name the type, to avoid potential confusion (e.g. see RawContentValue).

Other places that should be updated:

@morph-dev morph-dev added good-first-issue Good for newcomers refactoring Code refactoring flamingo Maintenance or downtime for the person on Flamingo rotation to tackle good-community-issue labels Aug 26, 2024
@etherhood
Copy link
Contributor

Can I take this one?

@morph-dev
Copy link
Collaborator Author

Can I take this one?

@etherhood, I would wait for #1403 to be merged first, as it does background for this work.
You can probably start working on top of it, but you might have to merge and resolve conflicts if there are some comments/concerns there.

@etherhood
Copy link
Contributor

I see its merged now, If it is fine with you, I will start on it

@etherhood
Copy link
Contributor

I have added some commits in WIP PR #1431
I changed inner type of RawContentKey and fixed its uses across codebase.

Should I also change Vec type to Bytes in all places wherever Vec is being used, as I couldn't find RawContentKey being used in portalnet/src/find/query_info.rs, which you mentioned in original issue, but there are usage of Vec

@morph-dev
Copy link
Collaborator Author

Should I also change Vec type to Bytes in all places wherever Vec is being used, as I couldn't find RawContentKey being used in portalnet/src/find/query_info.rs, which you mentioned in original issue, but there are usage of Vec

I would say replacing one type at the time is good (unless they have overlapping uses).

Regarding portalnet/src/find/query_info.rs, The Vec<u8> in RecursiveFindContentResult should be replaces with RawContentValue, but that should be a separate PR.

@etherhood
Copy link
Contributor

Okay thanks, that sounds about right.
I think than current PR is ready for review. Can you please review it. Thanks

@ogenev
Copy link
Member

ogenev commented Sep 20, 2024

Fixed by #1431 .

@ogenev ogenev closed this as completed Sep 20, 2024
@morph-dev
Copy link
Collaborator Author

I wouldn't say that this is fully completed. Not all places from above were converted.

@morph-dev morph-dev reopened this Sep 20, 2024
@etherhood
Copy link
Contributor

I can finish this, just to be sure, is there places where we would rather use only Bytes instead of using RawContentKey?

@morph-dev
Copy link
Collaborator Author

Here is an attempt to make a list (I will also update first post with it):

This list is probably not complete, but I believe this issue can be closed once it's done. Of course, all places that use these should be updated as well (e.g. we shouldn't just copy Vec<u8> to Bytes and back in all places).

Ideally, this should be done in several small PRs, each fixing only one or two. Due to their interconnection (especially in portalnet crate), this might not be trivial. In this case, in order to prevent PR from exploding in size, it's fine to have conversion Vec<u8> from/to Bytes in between PRs.

@carver
Copy link
Collaborator

carver commented Oct 3, 2024

Just a reminder that when changes like this go into ethportal-api, it affects external projects. Currently, the main one we consider is glados, but it's good to be thinking about what impact we're having on other projects that want to use portal through trin APIs. I'm ambivalent about forcing the API users to add alloy as a dependency for the Bytes. I'm not happy with having to_bytes() mean alloy's version of it.

IMO it's ugly to have clients have to do something like &content_key.to_bytes()[..] or content_key.to_bytes().to_vec() to get raw bytes, which seems strangely redundant after already calling to_bytes(). One option I like better is to keep the external API of to_bytes() -> Vec<u8> and a new optional one that's like to_alloy_bytes() if you want the listed benefits.

This is all coming up while working on ethereum/glados#321

@etherhood
Copy link
Contributor

@morph-dev Aren't these supposed to be alloy_primitives::Bytes ?

@KolbyML
Copy link
Member

KolbyML commented Oct 3, 2024

Just a reminder that when changes like this go into ethportal-api, it affects external projects. Currently, the main one we consider is glados,

cough cough Portal Hive purely uses ethportal-api for generating JSON-RPC requests 😎

Glados still uses quite a bit of custom method calls, instead of solely using ethportal-api for making Portal JSON-RPC calls.

@carver
Copy link
Collaborator

carver commented Oct 3, 2024

I suppose we don't really need to_bytes() -> Vec<u8> at all, if we just implement Into<Vec<u8>>. But I still feel that to_bytes() should not return alloy's version of Bytes. I think the method should have a different name.

cough cough Portal Hive purely uses ethportal-api for generating JSON-RPC requests 😎

Haha, yes, excellent callout. How are you feeling about the current and proposed API changes, as a "user" of it in Hive?

@morph-dev
Copy link
Collaborator Author

@morph-dev Aren't these supposed to be alloy_primitives::Bytes ?

These are low level network types, and it's fine for them to use bytes::Bytes. They can easily we wrapped in RawContentKey / RawContentValue if/when needed.

But since there is some concerns from other team members, please wait a bit before you continue with your work (until we agree what we want).

@morph-dev
Copy link
Collaborator Author

This reply is mostly to @carver's concerns. Happy to discuss this further, either here, or over a call.

I understand that there is concern regarding switching from Vec<u8> to something else, but I'm not sure if there is concern that that something else is alloy_primitives::Bytes.

Not using Vec<u8>

I want to make a strong case that we shouldn't use Vec<u8> for representing raw content key/value, anywhere.
This type is never modified, and is frequently copied and passed around. As such, bytes::Bytes is much better alternative, primary because copy is cheap (you can read more on official documentation 1 and 2).

I would use the same reason for not wanting to implement Into<Vec<u8>>, as this type shouldn't be used as Vec<u8> when we have better alternative (Bytes). Having multiple ways to represent it as bytes would just cause confusion and wrong usage.

Instead, we should update our codebase (including Glados) and not use Vec<u8> (what this issue is all about).

The only situation where we should use it as Vec<u8> is when interacting with some other libraries (e.g. database library might not support Bytes natively). But we can easily convert Bytes to/from Vec<u8>, so I don't think this is an issue.

I also don't agree that it's usage is ugly.

  • If you just need a reference to bytes, argument type should be &[u8], you can pass it with & and it will work
    • if you have content key -> &content_key.to_bytes()
    • if you have raw_content_key: Bytes -> &raw_content_key
  • If you need to hold it for longer, you should pass it as Bytes (copy is cheap)
  • I would say it's fine to have key.to_bytes().to_vec() in some places until codebase migrates to Bytes (that's what we do currently in trin, and fixing this issue will remove that)
  • In case I'm missing something, you can point me to the code where it's not nice and I can take a look.

alloy_primitives::Bytes vs bytes::Bytes

The alloy version is just a wrapper around bytes::Bytes that supports “0x” prefixed hex strings for serde (which is how we use content keys/values). Also, we depend on alloy_primitives all over ethportal-api and it will be needed for any other crate that depends on ehtportal-api (if for no other reason, then because of Address, U256 and B256 types). We can also export this type if you feel like it's needed.

I would also be fine if we would use bytes::Bytes instead (e.g. for to_bytes() method), and wrapped it in alloy_primitives::Bytes only during serialization/deserialization. But I don't see any good argument for that other than: "it shouldn't be alloy version". Considering that bytes::Bytes is not part of std and is also external dependency (definitely more popular than alloy), I don't consider this alone as good enough reason. But I can be convinced otherwise.

Non-backward compatible changes to ethportal-api

I agree that this type of changes can be a problem for other projects that depend on ethportal-api (e.g. Glados, Hive), and I wasn't thinking about that (my mistake). Ideally, we should:

  1. introduce new api and keep old one as deprecated
  2. allow some time for projects to migrate
  3. remove old api

I know you are currently working on getting Glados to the point where it can depend on newest ethportal-api (or maybe you already finished that?!), and I'm willing to help with migration.

@carver
Copy link
Collaborator

carver commented Oct 13, 2024

Ok, just organizing my thoughts here, now that I'm done with the API upgrade work. If we end up needing more follow-up, let's do a call.

I think we're agreeing on most of this. I definitely get the benefits of Bytes, and having it as an option is something I'm 💯 on board with. I'm even quite happy with it being the default option, and having it be the thing returned by to_bytes().

I guess where we are currently split is in intentionally hiding away the to_vec() to kind of try to save users from themselves, and whether that makes using the library uglier.

The one option missing from this list is when a user is forced to convert to a Vec because of using a 3rd party library like a DB (as you mentioned elsewhere):

I also don't agree that it's usage is ugly.

  • If you just need a reference to bytes, argument type should be &[u8], you can pass it with & and it will work

    • if you have content key -> &content_key.to_bytes()
    • if you have raw_content_key: Bytes -> &raw_content_key
  • If you need to hold it for longer, you should pass it as Bytes (copy is cheap)

  • I would say it's fine to have key.to_bytes().to_vec() in some places until codebase migrates to Bytes (that's what we do currently in trin, and fixing this issue will remove that)

  • In case I'm missing something, you can point me to the code where it's not nice and I can take a look.

So in this case:
https://github.com/ethereum/glados/blob/e648b186572ae7355a7672f9996514e33345ddbd/glados-audit/src/selection.rs#L470

We can't migrate to Bytes because we have to pass the value along to sea-orm.

I think our ethportal API is better if content_key.to_vec() is an option. We can have the docstr remind folks to use to_bytes() whenever possible. But hiding it away behind content_key.to_bytes().to_vec() feels a little like a slap on the wrist, despite doing something quite normal, when interacting with a 3rd-party lib. 😆

I understand that there is concern regarding switching from Vec<u8> to something else, but I'm not sure if there is concern that that something else is alloy_primitives::Bytes.

I do think there's a little bit of "ugh, I have to learn the intricacies of another library" for alloy_primitives::Bytes. I probably would have had a smaller reaction for bytes::Bytes which has 1000x more usage. But I guess I mostly feel agnostic to which one we return. Happy to defer on that.

Not using Vec<u8>

I want to make a strong case that we shouldn't use Vec<u8> for representing raw content key/value, anywhere.

For any users of the API library, ever? We already have the first counter-example. I'm sure there are plenty more reasonable ones.

I'm happy to say we shouldn't ever use it internally to trin. And we should encourage that pattern in our API docs.

@morph-dev
Copy link
Collaborator Author

@carver and I had a chat about this and here is the summary (@carver , let me know if I forgot something):

  • we should add more documentation in code on why Bytes should be used over Vec<u8> and also explain how one type should be converted to another
    • documentation is intended for users who are not familiar with details of the Bytes, and expect to have Vec<u8>, which is the most standard way to represent byte array
  • we agreed that it would be better if public type would be bytes::Bytes instead of alloy::primitives::Bytes (because it's more likely that users would be more familiar with bytes::Bytes version)
    • it was left to me to look into how easy/convenient would be to change this, with the idea that we can convert it to/from alloy::primitives::Bytes only when we need to (e.g. (de)serialization)

After looking into more details, I think that public type should still be alloy::primitives::Bytes (not bytes::Bytes). Here are my reasons:

  • It implements more frequently used traits, the way we would expect it to
    • Display, FromHex, FromStr
    • ssz::Encode and ssz::Decode
    • Arbitrary
  • Some traits that bytes::Bytes implement, don't behave as expected:
    • Debug: serializes it as ascii string (instead of hex decimal)
    • Serialize: it serializes it as list of numbers: "[0,1,2,3,10,100,200,250,255]" (we would want hex string)
    • Deserialize: It would happily deserialize any string, but it would interpret it as ascii byte array
      • for example: "0x00" would deserialize into "[48,120,48,48]"

With this being said, @carver, let me know if you still think that we should use bytes::Bytes.

I will create a PR to update documentation.

@etherhood , you can proceed with refactoring (if you still wish to do so). I think everything here is still accurate: #1404 (comment)

@carver
Copy link
Collaborator

carver commented Oct 20, 2024

Yup, that looks complete. 👍🏻 for sticking with the alloy version.

  • we should add more documentation in code on why Bytes should be used over Vec<u8> and also explain how one type should be converted to another

To be explicit about this ask: it would be great to see a docstring on each to_bytes() method with something like:

One benefit of returning Bytes is that it does not clone the underlying data. If you must have a vector, then clone this data into a vector with .to_bytes().to_vec().

Hm, though I see that this claim is aspirational at the moment, since we do actually clone all the data on every to_bytes() right now:

fn to_bytes(&self) -> RawContentKey {
let mut bytes: Vec<u8> = Vec::new();
match self {
HistoryContentKey::BlockHeaderByHash(k) => {
bytes.push(HISTORY_BLOCK_HEADER_BY_HASH_KEY_PREFIX);
bytes.extend_from_slice(&k.block_hash);
}
HistoryContentKey::BlockBody(k) => {
bytes.push(HISTORY_BLOCK_BODY_KEY_PREFIX);
bytes.extend_from_slice(&k.block_hash);
}
HistoryContentKey::BlockReceipts(k) => {
bytes.push(HISTORY_BLOCK_RECEIPTS_KEY_PREFIX);
bytes.extend_from_slice(&k.block_hash);
}
HistoryContentKey::BlockHeaderByNumber(k) => {
bytes.push(HISTORY_BLOCK_HEADER_BY_NUMBER_KEY_PREFIX);
bytes.extend_from_slice(&k.block_number.as_ssz_bytes());
}
}
bytes.into()
}

Which means that every to_bytes().to_vec() doubly-clones the key. I don't see a quick fix to avoid that. Which kind of brings me back to: maybe we should just have a to_vec() on the underlying key. Since that would actually save us a clone compared to to_bytes().to_vec().

@morph-dev
Copy link
Collaborator Author

morph-dev commented Oct 20, 2024

To be explicit about this ask: it would be great to see a docstring on each to_bytes() method with something like:

One benefit of returning Bytes is that it does not clone the underlying data. If you must have a vector, then clone this data into a vector with .to_bytes().to_vec().

I just created #1543. Let me know there if something else should be clarified.

Which means that every to_bytes().to_vec() doubly-clones the key. I don't see a quick fix to avoid that. Which kind of brings me back to: maybe we should just have a to_vec() on the underlying key. Since that would actually save us a clone compared to to_bytes().to_vec().

Yes, I just noticed this as well. And few other API changes that I wanted to improve. So I'm working on small refactoring.

For the explicit code that you linked, I believe that converting from Vec<u8> into Bytes doesn't actually copy data:
https://github.com/tokio-rs/bytes/blob/0ac54ca706dfc039cc738962581bba4793860605/src/bytes.rs#L870-L903

But instead of using Vec<u8>, I believe that we should use BytesMut instead anyway.

What do you think?

Sidenote: I'm not sure if Bytes into Vec<u8> always copies data. If there is another "view" of the bytes, then it definitely does. But if converted bytes are the only "view", then it might not (but I'm not 100%) sure. So something like this, might not copy any data at all: content_key.to_bytes().to_vec(), even with current implementation.

@morph-dev
Copy link
Collaborator Author

Refactoring that I mentioned are in #1544

@etherhood
Copy link
Contributor

I will resume working on it.

@etherhood
Copy link
Contributor

@morph-dev Created PR #1552, updating Vec to RawContentKey/RawContentValue or bytes::Bytes wherever applicable, please review.
There are some places where didn't remove it, as it was being using in some intermediary functions. We can target rest of places in next PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flamingo Maintenance or downtime for the person on Flamingo rotation to tackle good-community-issue good-first-issue Good for newcomers refactoring Code refactoring
Projects
None yet
Development

No branches or pull requests

6 participants
@carver @KolbyML @ogenev @morph-dev @etherhood and others