Skip to content

Evals section update #515

@nearestnabors

Description

@nearestnabors

I've reviewed:

Suggested edits for both

  • Use the "Outcomes/YWL/Prereqs" header from the Build Tools content
  • Match the page title to the navigation item—they do not!

Edits for Evaluate Tools

  • Step 2 has navigating to my_server, but if you created an MCP server as per Prerequisites, you'll already be in that folder. Rephrase as, "in your server's root folder, create a new Python file..."
  • Step 4 "Run the evaluation" has:
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
arcade evals .

Split these into two. The evals call should stand alone.

Reference your recent quickstart for how to handle environment variables. Possibly call this out in a warning (hey, don't forget to set your env variables!).

  • Remove the bit about the different providers. Currently it ONLY works with open AI.
  • Move "How it works" and "Next Steps" outside the steps.
  • "Critic Classes" should be moved to a reference page. Consider consolidating with their explanations in Why Evaluations?.
  • Advanced evaluation cases could also be moved to its own page. Remember, the outcome of this page was to evaluate.

Run Evaluations/Run evaluations with the Arcade CLI

  • Overall this page is both guide (how Evals work) where the former is a tutorial, and it's also a reference (all the options). It is like a non-tutorial version of the last page, which makes it a little repetitive. I would lean into making this a comprehensive guide for arcade evals and move the advanced content from Evaluate in to it. You might split this into a guide as well as a reference, to DRY and shorten the pages (folks looking for a command reference are not looking for a tutorial)

  • The section on Handling multiple models needs to be removed. Currently it only supports OpenAI, though you could just point this out and say "more coming soon!"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions