Skip to content

start_tokens in synthetic prompt is not related to prompt_text #360

@tukwila

Description

@tukwila

Describe the bug
A clear and concise description of what the bug is.

When i do benchmark testing using synthetic prompts, the prompts are started with not unified language such as:

In this screenshot, start_tokens are different languages that are followed by english.
Image

In this screenshot, start_tokens are different with following German.
Image

Such prompt_tokens are inappropriate for benchmark testing.

and i write one script to dig into this problem:

import json

from guidellm.dataset.synthetic import (
    SyntheticDatasetCreator,
)

def test_handle_create_basic():
    # Example: German
    # data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/30793/pg30793.txt" 
    # Example: English
    data = "prompt_tokens=128,output_tokens=56, samples=10, source=https://www.gutenberg.org/cache/epub/1342/pg1342.txt" 
    synthetic_creator=SyntheticDatasetCreator()
    res_list = []
    result = synthetic_creator.handle_create(
            data=data,
            data_args=None,
            processor="${local_path}/Qwen2.5-1.5B-Instruct",
            processor_args=None,
            random_seed=42,
        )
    res_list = result[:]['prompt']
    with open('./syntheic_prompts.json', 'w', encoding='utf-8') as f:
        json.dump(res_list, f, ensure_ascii=False, indent=4)
        
if __name__ == "__main__":
    test_handle_create_basic()

Expected behavior
A clear and concise description of what you expected to happen.

start word should be the same with the whole prompt.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.12.2]: Python 3.9.9
  3. guidellm version: 0.3.0

To Reproduce
Exact steps to reproduce the behavior:
step1. copy self-test scripts from above into local env.
step2. python ./test_synthetic_prompt.py

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context
Add any other context about the problem here. Also include any relevant files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions