Add LLM as Juddge evaluation metric #40

NISH1001 · 2024-12-12T21:33:05Z

Major Changes

Add evalem.nlp.metrics.llm.LLMAsJudgeMetric that uses LLM to evaluate and compare. Can be installed via [llm] namespace.

Minor Changes

Add evalem._base.structures.SequenceType to encapsulate Union[list, tuple, set]
Add imports to dunder init methods for easy imports of NLP metrics.
Upgrade major dependencies

Usage

from evalem.nlp import LLMAsJudgeMetric

model = "ollama/llama3.2:3b"
api_base = "http://localhost:11434/v1"

references = [...]
predictions = [...]

LLMAsJudgeMetric(
    model=MODEL,
    api_base=API_BASE,
    api_key=os.environ.get("OPENAI_API_KEY"),
    # api_key=None,
    n_tries=3,
    prompt=PROMPT,
    debug=True,
).compute(
    references=references,
    predictions=predictions,
)

Add `evalem.nlp.metrics.llm.LLMAsJudgeMetric`

NISH1001 added 5 commits November 27, 2024 12:07

Upgrade requirements

47a4cd2

bump up version

d78772d

Add LLM-as-judge metric

1ac1d3f

Add `evalem.nlp.metrics.llm.LLMAsJudgeMetric`

Update documentation for LLMAsJudgeMetric

b85952a

Add SequenceType type to encapsuliate list/tuple/set

1732144

NISH1001 merged commit e3a9c46 into develop Dec 12, 2024
2 of 5 checks passed

NISH1001 deleted the feature/llm-as-judge branch December 12, 2024 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LLM as Juddge evaluation metric #40

Add LLM as Juddge evaluation metric #40

Uh oh!

NISH1001 commented Dec 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add LLM as Juddge evaluation metric #40

Add LLM as Juddge evaluation metric #40

Uh oh!

Conversation

NISH1001 commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Changes

Minor Changes

Usage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NISH1001 commented Dec 12, 2024 •

edited

Loading