llm-flaky

Pytest plugin for running non-deterministic LLM tests.

LLM tests are inherently non-deterministic due to the probabilistic nature of language models. This plugin handles flakiness by automatically retrying tests and requiring an 80% pass rate (4/5 by default).

Features

Auto-marking: Automatically applies @pytest.mark.flaky to tests with @pytest.mark.llm
80% accuracy default: Tests pass if 4 out of 5 runs succeed (configurable)
Beautiful reports: Replaces standard flaky output with a formatted table
Environment variable support: Use FLAKY_MAX_RUNS to control retries
pytest-xdist compatible: Works correctly with parallel test execution

Installation

pip install llm-flaky

Usage

Mark your LLM tests with @pytest.mark.llm:

import pytest

@pytest.mark.llm
async def test_llm_response():
    response = await call_llm("What is 2+2?")
    assert "4" in response

The plugin automatically applies flaky retry logic. No additional code needed!

Example output

══════════════════════════════════════════════════════════════════════════════
 LLM TESTS SUMMARY
══════════════════════════════════════════════════════════════════════════════

 Test                                                     Passed       Result
 ────────────────────────────────────────────────────────────────────────────
 test_llm_response_quality                                 4 / 4     ✓ PASSED
 test_llm_context_handling[short]                          4 / 4     ✓ PASSED
 test_llm_context_handling[long]                           3 / 4     ✓ PASSED

 ✗ FAILED TESTS:
 ────────────────────────────────────────────────────────────────────────────
 test_llm_edge_case                                        2 / 4     ✗ FAILED
 ────────────────────────────────────────────────────────────────────────────
 ⚠ Total                                                   3 / 4       75.0%
══════════════════════════════════════════════════════════════════════════════

Configuration

Environment variables

FLAKY_MAX_RUNS=3 pytest  # Run each test up to 3 times (min_passes=2)

Command line options

pytest --llm-flaky-max-runs=5           # Max runs for LLM tests (default: 5)
pytest --llm-flaky-min-passes=4         # Min passes required (default: max_runs - 1)
pytest --llm-flaky-exclude-marker=skip  # Marker to exclude from flaky
pytest --llm-flaky-title="My Title"     # Custom report title
pytest --no-llm-flaky-report            # Disable beautiful report

pytest.ini options

[pytest]
llm_flaky_max_runs = 5
llm_flaky_min_passes = 4
llm_flaky_exclude_marker = langsmith_dataset
llm_flaky_title = LLM TESTS SUMMARY

Priority

Configuration is read in this order (highest priority first):

FLAKY_MAX_RUNS environment variable
Command line options (--llm-flaky-*)
pytest.ini options (llm_flaky_*)
Defaults (max_runs=5, min_passes=4)

How it works

Collection phase: Plugin finds all tests with @pytest.mark.llm
Auto-marking: Applies @pytest.mark.flaky(max_runs=5, min_passes=4)
Execution: pytest-flaky handles retry logic
Reporting: Beautiful summary table replaces standard output

Excluding tests

Tests with @langsmith_dataset marker are excluded by default (they use LangSmith's built-in evaluation):

@pytest.mark.llm
@langsmith_dataset("my_dataset.yaml")
async def test_with_langsmith():
    # This test won't get flaky retry - LangSmith handles evaluation
    pass

Requirements

Python >= 3.9
pytest >= 7.0.0
flaky >= 3.7.0

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
src/llm_flaky		src/llm_flaky
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-flaky

Features

Installation

Usage

Example output

Configuration

Environment variables

Command line options

pytest.ini options

Priority

How it works

Excluding tests

Requirements

License

About

Uh oh!

Releases 1

Packages

Languages

License

retailcrm/llm-flaky

Folders and files

Latest commit

History

Repository files navigation

llm-flaky

Features

Installation

Usage

Example output

Configuration

Environment variables

Command line options

pytest.ini options

Priority

How it works

Excluding tests

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages