Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor ceramic simulations and new benchmark for V' features #149

Merged
merged 10 commits into from
Feb 21, 2024

Conversation

dav1do
Copy link
Contributor

@dav1do dav1do commented Feb 14, 2024

Linear ws1-1438

Summary

The main goal was to add a new simulation to benchmark Model Instance Document creation for the E2E ceramic V' (recon) system. I ended up making a bunch of changes along the way and can potentially split this into multiple PRs if it's too hard to review.

I created a new scenario called ceramic-new-streams-benchmark that creates a model, all nodes index the model, and then 1 node creates new MIDs. We investigate how fast these sync to the other nodes in order to determine if we succeed. As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end. It respects the throttleRequests and successRequestTarget values.

Details

  • I had to expose the ceramic node admin DID's private key to the simulation in order to interact with the endpoints (i.e. index models). I added a did_from_pk function for this (similar to composedb), but it should probably be moved to rust-ceramic.

  • I corrected a handful of bugs: indexing was not succeeding, and some of the simulations did not seem to do exactly what they were designed to (e.g. running something per node instead of per simulation). I adjusted the global/node/user goose leader functions to support the various control flows, and corrected the requests to log their statuses to see a better summary of the simulation and request breakdown. This does change how the output looks, for example the ceramic simple scenario now shows more request info, even though the behavior should be unchanged.

  • There is now a simulation configuration object that allows setting up the model instance specific details. This includes things like how signing should be done (user specific DID key, DID per node), whether models should be shared across nodes or not. The setup makes sure this happens appropriately, and stores the goose user data/config so that the simulations can proceed without really caring what they're using (they just use the ceramic http client and the models/MIDs in the goose data struct).

    • I refactored the Ceramic simulations to use this config. They are almost all the same with a slight variation and I wanted to take this further, but already got too into the weeds. We could almost have one file to define all the operations (get/update/etc) and parameterized config to allow changing the behavior, which would allow for more interesting simulations without requiring code changes. This would entail exposing some of the config options to the simulation in order to get passed down. Anyway, for now, having a bunch of simulation names is fine.
    • The reason for supporting user DIDs is that the js-ceramic worker threads changes parallelize on the DID, so we wanted a more realistic scenario where not every user in the system was the same.
  • I noticed that the local-scripts didn't work right and fixed them (they relied on local config I had).

  • The new postgres database ignored the devMode flag when setting its resources, now it doesn't.

Results

Results from a 1 minute run on my MBP just to demonstrate that it works:

Worker 1 did not meet the target request rate: 228.16666666666666 < 300 (total requests: 13690 over 60)
Worker 2 did not meet the target request rate: 160.91666666666666 < 300 (total requests: 9655 over 60)

Follow up

When generating random data for models, I ran into an error like the following that crashed js-ceramic. I'm not yet sure what/where the bug is (that is, it might be postgres config/not setting the encoding to utf8), it might be js-ceramic, or something else. I switched to strings as a result but need to follow up.

[2024-02-18T17:27:14.774Z] IMPORTANT: Ceramic API running on 0.0.0.0:7007'
/js-ceramic/node_modules/rxjs/dist/cjs/internal/util/reportUnhandledError.js:13
            throw err;
            ^

error: insert into "kjzl6hvfrbw6ca1eadhvu4l06ssws62ks7gmgzc8285p2xmqaaiw0l57paapfnp" ("controller_did", "created_at", "first_anchored_at", "last_anchored_at", "stream_content", "stream_id", "tip", "updated_at") values ($1, $2, $3, $4, $5, $6, $7, $8) on conflict ("stream_id") do update set "stream_id" = $9,"controller_did" = $10,"stream_content" = $11,"tip" = $12,"last_anchored_at" = $13,"first_anchored_at" = $14,"updated_at" = $15 - unsupported Unicode escape sequence
    at Parser.parseErrorMessage (/js-ceramic/node_modules/pg-protocol/dist/parser.js:287:98)
    at Parser.handlePacket (/js-ceramic/node_modules/pg-protocol/dist/parser.js:126:29)
    at Parser.parse (/js-ceramic/node_modules/pg-protocol/dist/parser.js:39:38)
    at Socket.<anonymous> (/js-ceramic/node_modules/pg-protocol/dist/index.js:11:42)
    at Socket.emit (node:events:518:28)
    at Socket.emit (node:domain:488:12)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at Readable.push (node:internal/streams/readable:390:5)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  length: 250,
  severity: 'ERROR',
  code: '22P05',
  detail: '\\u0000 cannot be converted to text.',
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: 'JSON data, line 1: ...葎󲼯􊋁򃻐򖕟𭅨򩗒𺽣񎾏򻔻񟢽\\u0000...\n' +
    "unnamed portal parameter $5 = '...'",
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'jsonfuncs.c',
  line: '621',
  routine: 'json_ereport_error'
}

Copy link

linear bot commented Feb 14, 2024

@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch from 2345de7 to 61544b6 Compare February 16, 2024 19:53
dav1do and others added 7 commits February 16, 2024 15:00
Most of the scenarios are very simlilar. Add a more config based approach, allowing changing whether models and keys are shared or unique. Added naming of requests to get more granular goose metrics.
Optionally share model instance docs across simulation
@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch 2 times, most recently from 3d4c112 to 4c065a3 Compare February 19, 2024 23:41
Includes some refactoring (sorry) and collecting metrics for success.
This scenario doesn't have a prom metric as easily accessible afaik, so I used redis to have the worker capture the values before/after.
@dav1do dav1do force-pushed the feat/ws1-1438-mid-simulation branch from 4c065a3 to a6976f6 Compare February 20, 2024 16:53
@dav1do dav1do marked this pull request as ready for review February 20, 2024 17:29
Copy link
Collaborator

@nathanielc nathanielc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for taking on this refactor along with the end goal.

As there are no metrics for this available afaik, and the workers don't currently have access to prometheus, I ended up using redis to store before/after data on the workers so the manager can interrogate the results at the end.

This seems fine, using redis makes it quick and easy. My only question is should we invest the time now to add the metrics to the js-ceramic process and expose it to the goose workers? Is that the best long term strategy? Not that we will block this PR on that work but rather should we capture this need as issue and start on it right away?

@dav1do
Copy link
Contributor Author

dav1do commented Feb 20, 2024

This seems fine, using redis makes it quick and easy. My only question is should we invest the time now to add the metrics to the js-ceramic process and expose it to the goose workers? Is that the best long term strategy? Not that we will block this PR on that work but rather should we capture this need as issue and start on it right away?

I think we should definitely have some metrics in js-ceramic that make it easy for someone to tell what's going on with their node but I'm not sure what we currently have. I'm not sure how urgent/important exposing them to the simulations is but it seems like it could be quite useful, especially as we use keramik and the simulations to identify and investigate issues, and add more correctness validation to them.

Right now, correctness checks require doing scenario specific things, but with the shared setup, we could also have some "shared correctness" checks at the end if metrics are exposed. While some validations are going to be scenario specific, having some sanity validations and/or re-publishing specific metrics to the goose output, could simplify review and understanding things. I think having a dashboard of "key metrics" for simulations would be incredibly helpful toward a holistic understanding of the network health and the simulation status/success (especially as we try to include or exclude CAS and other services in the benchmarks). Some of this might be possible already and just require providing an integrated dashboard/steps/etc in order to make it happen.

sometimes hit errors like `failed to get model count: TransactionError: request canceled because throttled load test ended (sending on a closed channel)` so we use our own reqwest client instead of the goose supplied one on the user. this only happened if the `throttleRequests` value is set
@dav1do dav1do added this pull request to the merge queue Feb 21, 2024
Merged via the queue into main with commit ff53def Feb 21, 2024
5 checks passed
@dav1do dav1do deleted the feat/ws1-1438-mid-simulation branch February 21, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants