-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap state network with state data post block 21M #1599
Comments
The remaining question is whether to go with 16 or 32 nodes, and how big their storage should be.
|
It is important to note we can start with 16 nodes, and add 16 more nodes if we need them in the future very easily. |
Goal/Context for this issueThis PR doesn't include our goal/target for this which I think is important context. We want to have all state available on the Portal State network from block 21 million to latest with a delay of 8-16 blocks by February 1st or so. We have the code infrastructure to be able to run the Portal State Network with a block delay of 8-16 blocks, because we don't have the infrastructure to handle re-orgs yet, so we can't gossip at the tip of the chain until Trin-Execution has that implemented. There are potentially a lot of unknowns and pitfalls, which can cause a lot of delays. We won't know until we did it and have it live. So getting this started early is important. So by getting this infrastructure deployed asap, we should hopefully be able to reach our goal of the Portal State Network having State live by February 1st ideally. If we want to optimize code this can happen after Latest is live. Optimizations shouldn't block getting Portal State live, unless it is required to get Portal State live, which would be highly rare and unlikely. "Optimizations" are a really easy pitfall to fall under, that is a problem for when we have a working product (hopefully with users). If what we have works, we shouldn't delay shipping it as there are potential unknowns which could then cause us to miss our deadlines. Non-standard deployment strategyThe infrastructure for deploying this i.e. Our bridges will be running non-master branch code. Currently Hopefully once all of our infrastructure is underway for us reaching our deadlines. I will work on |
For the Fluffy State Bridge I was thinking about implementing support for gossiping data into the network using multiple Fluffy instances and load balancing the offers across the instances in order to speed up the process. Our state bridge already supports concurrently gossiping multiple blocks at the same time but all through a single Fluffy instance over JSON-RPC. Since we are using the standard JSON-RPC portal_stateGossip endpoint we can easily spin up multiple portal nodes and then load balance the content through these instances. This should in theory dramatically speed up the gossip speed by effectively horizontally scaling the process. Using this method the bottleneck would likely become the upload bandwidth limit but running inside AWS or another cloud provider you could get very fast uploads for a cost. For example the AWS egress costs for uploading 11TB is around $1000. Just sharing as an alternative scaling idea. I'm not sure how well it will perform in practice, but if/when we implement it I'll share the results. |
After initial investigation, we concluded that it would take 1200 days for single bridge to gossip the state snapshot for block 21M.
Even with some optimization and parallelization (requiring non-trivial engineering effort), just gossiping the data would take around 1 month.
Plan
To speed up to process, we came up with the following plan:
This approach saves us from gossiping 11.2 TB of data across network, only to end up storing ~400 GB worth of data (the remaining 10.8 TB of data are proofs that we don't store).
Downsides
We considered following downsides of this approach:
Wouldn't this create Big vs Small nodes problem (making content not easily discoverable)?
Considering the current size of the network (less than 100 nodes), we don't believe that this is such a big problem. Especially if we go with 32 nodes that are evenly distributed.
Plan is in the future to convert these nodes into "Large radius nodes" (see: ethereum/portal-network-specs/#283), which would completely eliminate this problem.
This content wouldn't be available on other portal network nodes
Couple of reasons why we don't see this as a big issue at the moment:
The state diffs since block 21M would still be gossiped via regular bridges, distributing them to all other nodes on the network. However, they are not enough in order to guarantee that we can execute transactions at the head of the chain (therefore we need snapshot from a block 21M).
We believe that most state data needed to execute transactions at the head of the chain comes from block diffs and not from the snapshot sync, meaning that other nodes will contain most of the useful data.
The capacity of current nodes on the network is not big enough to support this amount of data efficiently. If other clients want to bootstrap their infrastructure enough to contain this data, they can use similar approach as well.
The text was updated successfully, but these errors were encountered: