-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: refactor ceramic simulations and new benchmark for V' features #149
Conversation
2345de7
to
61544b6
Compare
Most of the scenarios are very simlilar. Add a more config based approach, allowing changing whether models and keys are shared or unique. Added naming of requests to get more granular goose metrics.
Optionally share model instance docs across simulation
3d4c112
to
4c065a3
Compare
Includes some refactoring (sorry) and collecting metrics for success. This scenario doesn't have a prom metric as easily accessible afaik, so I used redis to have the worker capture the values before/after.
4c065a3
to
a6976f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for taking on this refactor along with the end goal.
As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end.
This seems fine, using redis makes it quick and easy. My only question is should we invest the time now to add the metrics to the js-ceramic process and expose it to the goose workers? Is that the best long term strategy? Not that we will block this PR on that work but rather should we capture this need as issue and start on it right away?
I think we should definitely have some metrics in js-ceramic that make it easy for someone to tell what's going on with their node but I'm not sure what we currently have. I'm not sure how urgent/important exposing them to the simulations is but it seems like it could be quite useful, especially as we use keramik and the simulations to identify and investigate issues, and add more correctness validation to them. Right now, correctness checks require doing scenario specific things, but with the shared setup, we could also have some "shared correctness" checks at the end if metrics are exposed. While some validations are going to be scenario specific, having some sanity validations and/or re-publishing specific metrics to the goose output, could simplify review and understanding things. I think having a dashboard of "key metrics" for simulations would be incredibly helpful toward a holistic understanding of the network health and the simulation status/success (especially as we try to include or exclude CAS and other services in the benchmarks). Some of this might be possible already and just require providing an integrated dashboard/steps/etc in order to make it happen. |
sometimes hit errors like `failed to get model count: TransactionError: request canceled because throttled load test ended (sending on a closed channel)` so we use our own reqwest client instead of the goose supplied one on the user. this only happened if the `throttleRequests` value is set
Linear ws1-1438
Summary
The main goal was to add a new simulation to benchmark Model Instance Document creation for the E2E ceramic V' (recon) system. I ended up making a bunch of changes along the way and can potentially split this into multiple PRs if it's too hard to review.
I created a new scenario called
ceramic-new-streams-benchmark
that creates a model, all nodes index the model, and then 1 node creates new MIDs. We investigate how fast these sync to the other nodes in order to determine if we succeed. As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end. It respects thethrottleRequests
andsuccessRequestTarget
values.Details
I had to expose the ceramic node admin DID's private key to the simulation in order to interact with the endpoints (i.e. index models). I added a
did_from_pk
function for this (similar to composedb), but it should probably be moved to rust-ceramic.I corrected a handful of bugs: indexing was not succeeding, and some of the simulations did not seem to do exactly what they were designed to (e.g. running something per node instead of per simulation). I adjusted the global/node/user goose leader functions to support the various control flows, and corrected the requests to log their statuses to see a better summary of the simulation and request breakdown. This does change how the output looks, for example the ceramic simple scenario now shows more request info, even though the behavior should be unchanged.
There is now a simulation configuration object that allows setting up the model instance specific details. This includes things like how signing should be done (user specific DID key, DID per node), whether models should be shared across nodes or not. The setup makes sure this happens appropriately, and stores the goose user data/config so that the simulations can proceed without really caring what they're using (they just use the ceramic http client and the models/MIDs in the goose data struct).
I noticed that the local-scripts didn't work right and fixed them (they relied on local config I had).
The new postgres database ignored the
devMode
flag when setting its resources, now it doesn't.Results
Results from a 1 minute run on my MBP just to demonstrate that it works:
Follow up
When generating random data for models, I ran into an error like the following that crashed js-ceramic. I'm not yet sure what/where the bug is (that is, it might be postgres config/not setting the encoding to utf8), it might be js-ceramic, or something else. I switched to strings as a result but need to follow up.