This repository contains the code for the paper "Auditing Pay-Per-Token in Large Language Models" by Ander Artola Velasco, Stratis Tsirtsis and Manuel Gomez-Rodriguez.
Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models.
However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output.
In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting.
Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the Llama, Gemma and Ministral families, and input prompts from a popular crowdsourced benchmarking platform.
The results show that our framework detects an unfaithful provider after observing fewer than
All the experiments were performed using Python 3.11.2. In order to create a virtual environment and install the project dependencies, you can run the following commands:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
├── data
└──LMSYS.txt
├── figures
├──audit_faitful
├──audit_heuristic
└──audit_random
├── notebooks
├──audit_faitful_random.ipynb
├──audit_heuristic.ipynb
└──process_ds.ipynb
├── outputs
├──audit_faithful
└──audit_heuristic
├── scripts
├──script_slurm_audit_faithful.sh
│ └──script_slurm_audit_heur.sh
└── src
├──audit_faithful.py
├──audit_heuristic.py
└── utils.py
data
contains the processed set of LMSYS prompts used.figures
contains all the figures presented in the paper.notebooks
contains Python notebooks to analyze the audit data and generate all the figures included in the paper:audit_faitful_random.ipynb
analyzes the audit data when the provider uses the faithful policy or random policies.audit_heuristic.ipynb
analyzes the audit data when the provider uses the heuristic policies.process_ds.ipynb
builds the LMSYS dataset.
outputs
intermediate output files generated by the experiments' scripts and analyzed in the notebooks. They can be generated using the scripts in thesrc
folder.audit_faitful
contains answers generated in response to the LMSYS prompts to use in the audit for faithful policies and random policies.audit_heuristic
contains answers generated in response to the LMSYS prompts to use in the audit for heuristic policies.
scripts
contains a set of scripts used to run all the experiments presented in the paper.src
contains all the code necessary to reproduce the results in the paper. Specifically:audit_faithful.py
is the script used to generate model answers to the LMSYS Chatbot Arena dataset.audit_heuristic.py
is the script used to generate model answers to the LMSYS Chatbot Arena dataset and run the heuristic policies on the outputs.
Our experiments use LLMs from the Llama, Gemma and Mistral families, which are "gated" models, that is, they require licensing to use.
You can request to access it at: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct, https://huggingface.co/google/gemma-3-1b-it and https://huggingface.co/mistralai/Ministral-8B-Instruct-2410.
Once you have access, you can download any model in the Llama, Gemma and Mistral families.
Then, before running the scripts, you need to authenticate with your Hugging Face account by running huggingface-cli
login in the terminal.
Each model should be downloaded to the models/
folder.
The script audit_faithful.py generates the output needed to reproduce all experiments for the faithful and random policies. This is because the values of --model
to set a specific model, such as "L1B"
for meta-llama/Llama-3.2-1B-Instruct
, the flag --temperature
to set the temperature during generation, --prompts
to use a list of string as prompts (it uses by default the LMSYS prompts in data/LMSYS.txt
) and --poisson
to set the Poisson paramter used in the estimator of tokenizations lengths for a string.
The script audit_heuristic.py generates the output needed to reproduce all experiments for the heuristic policies. You can run it in your local Python environment or use the Slurm submission script on a cluster, using script_slurm_audit_heur.sh with your particular machine specifications. You can use the flags --model
to set a specific model, such as "L1B"
for meta-llama/Llama-3.2-1B-Instruct
, the flag --temperature
to set the temperature during generation, --prompts
to use a list of string as prompts (it uses by default the LMSYS prompts in data/LMSYS.txt
), --p
for the top-p value used in the heuristic verification step, and --poisson
to set the Poisson parameter used in the estimator of tokenization lengths for a string.
To reproduce all the figures, run the notebooks.
In case you have questions about the code, you identify potential bugs, or you would like us to include additional functionalities, feel free to open an issue or contact Ander Artola Velasco.
If you use parts of the code in this repository for your own research, please consider citing: