-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Labels
Description
Observed on interrupted stream in Anthropic models: usage comes back strangely low (like 26 input tokens) on a very long chat. Perhaps it's because the cache is getting thrashed and we're somehow not seeing those cache eviction tokens?
Observed on GPT-5-Pro: I believe we're estimating reasoning by counting tokens in the reasoning trace. This might be accurate for Anthropic models where reasoning is highly detailed, but it falls flat on its face on GPT-5-Pro and the costs are an order of magnitude off. We should see if the API provides the number of reasoning tokens by some other means.
One idea is to compute as reasoning_tokens = total_output_tokens - parsed_text_tokens. This might be provider-agnostic.