Skip to content

Unify llm judges into a single prepare file #1696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

martinscooper
Copy link
Collaborator

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

@lilacheden
Copy link
Member

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.

  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.

  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
    It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@martinscooper
Copy link
Collaborator Author

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.
  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.
  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly.
    It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@lilacheden

  1. With specific I mean that I think the instruction entry of the context fields should have a simpler field name by default. I would set the context_fields just as a list, which probably users wouldn't have to change:
{
  ...,
  "context_fields":  ["question", "instruction"],
  ...,
}

Then, if a user needs a more specific source from where the context field should be taken from, they could specify it manually for their use case.

  1. Agree.

  2. Sounds good. It is true that it is not that important to have all registered judges in the same file. Then I would move the judges back to its original file. Do you agree on calling get_evaluator_metadata() so that the params are consistent across all the judges?


from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import CreateYesNoCriteriaFromString
from unitxt.loaders import LoadFromDictionary

data = {
"test": [
{
"question": "How is the weather?",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this examples changed? Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I will remove it

Signed-off-by: Martín Santillán Cooper <[email protected]>
@martinscooper
Copy link
Collaborator Author

@yoavkatz @elronbandel I applied the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants