Inference script with chat template support for ONNX Runtime GenAI models.
Use this for models that require chat templates (e.g., GPT-OSS-20B) for better output quality.
Based on Microsoft OGA model-chat.py, modified for Ryzen AI.
- Ryzen AI Software installed (see Installation Instructions)
- Activate the conda environment created by the MSI installer:
conda activate ryzen-ai-<version>- For more details on running LLMs with OGA, see OnnxRuntime GenAI (OGA) Flow.
Interactive chat, single-prompt, and multi-turn context cache testing with timing.
This example is based on Microsoft's OGA example.
# Single prompt (fits within 4096 context by default)
python model_chat.py -m "path\to\model" -pr amd_genai_prompt.txt --timings
# Read only the first 256 tokens from the input prompt file
python model_chat.py -m "path\to\model" -pr amd_genai_prompt.txt -l 256
# Long context support (increase context window to e.g. 16k)
python model_chat.py -m "path\to\model" -pr amd_genai_prompt_long.txt -mpt 16000
# Read first 256 tokens from the input prompt file, and increase context window e.g. 16k
python model_chat.py -m "path\to\model" -pr amd_genai_prompt_long.txt -l 256 -mpt 16000
# Multi-turn context cache test (prompts sent sequentially, no rewind)
python model_chat.py -m "path\to\model" -j questions.json
# Interactive chat (use -r to clear KV cache between turns)
python model_chat.py -m "path\to\model"
python model_chat.py -m "path\to\model" -r
# Get all options
python model_chat.py -hThe -j flag accepts a JSON file where prompts are sent sequentially to test context cache reuse:
[
{"id": 1, "prompt": "Can you explain why the sky is blue?"},
{"id": 2, "prompt": "What about sunsets?"},
{"id": 3, "prompt": "Summarize what we discussed."}
]| Flag | Long | Default | Description |
|---|---|---|---|
-m |
--model_path |
— | (required) Path to ONNX model folder |
-pr |
--prompt_file |
— | Single prompt .txt file |
-j |
--json_file |
— | Multi-turn prompts .json file |
-l |
--prompt_length |
— | Read only the first N tokens from the prompt file. |
-mpt |
--max_prompt_tokens |
4096 | Model context window. Sets context cache size. Increase for long context (e.g. -mpt 16000). |
-or |
--output_reserve |
256 | Tokens reserved for generated output. |
-e |
--execution_provider |
follow_config |
cpu, cuda, dml, or follow_config |
-s |
--system_prompt |
You are a helpful AI assistant. |
System prompt |
-r |
--rewind |
off | Clear KV cache and rewind to system prompt between turns (interactive mode) |
-v |
--verbose |
off | Detailed debug output |
-tm |
--timings |
off | Print TTFT, prefill tps, decode tps |
-gtype |
--guidance_type |
none |
Guided generation: json_schema, regex, lark_grammar |
-ginfo |
--guidance_info |
— | Guidance payload |