This is the official repository of the paper Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale and the PersonaMem benchmark.
We present PersonaMem, a new personalization benchmark to assess how well language models can infer evolving user profiles and generate personalized responses across task scenarios. PersonaMem emphasizes persona-oriented, multi-session interactions between users and chatbots, facilitated by a synthetic dialog generation pipeline that simulates realistic and evolving conversational contexts.
Different users have different personas. Personalization in LLMs involves adapting model responses to individual users based on their traits, preferences, and interaction history. By analyzing previous interactions, LLMs learn to deliver more relevant and tailored responses to different users, rather than merely providing generic correct answers. As a result, personalization enhances the model’s effectiveness in various tasks such as writing assistance, recommendations, or consultations, and thereby user experience and engagement.
We investigate three research questions in LLM personalization:
- How well can LLMs internalize the user's inherent traits and preferences?
- Can LLMs track how user profiling and preferences evolve over time?
- Are LLMs able to generate personalized responses accordingly in new scenarios?
As shown in the overview, each benchmark sample is a user persona with static (e.g., demographic info.) and dynamic attributes (e.g., evolving preferences). Users engage with a chatbot in multi-session interactions across a variety of topics such as food recommendation, travel planning, and therapy consultation. As the user’s preferences evolve over time, the benchmark offers annotated questions assessing whether models can track and incorporate the changes into their responses.
If you find our work inspires you, please consider citing it. Thank you!
@article{jiang2025know,
title={Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale},
author={Jiang, Bowen and Hao, Zhuoqun and Cho, Young-Min and Li, Bryan and Yuan, Yuan and Chen, Sihao and Ungar, Lyle and Taylor, Camillo J and Roth, Dan},
journal={arXiv preprint arXiv:2504.14225},
year={2025}
}
We release the benchmark data of PersonaMem on Google Drive and 🤗Huggingface, including question-answer pairs, corresponding contexts, and other meta data. The dataset is available with three versions based on context token length:
- 32k tokens
questions_32k.csv
shared_contexts_32k.jsonl
- 128k tokens
questions_128k.csv
shared_contexts_128k.jsonl
- 1M tokens
questions_1M.csv
shared_contexts_1M.jsonl
Each questions_[SIZE].csv
file contains the following columns:
persona_id
: Unique ID for each user personaquestion_id
: Unique ID for each questionquestion_type
topic
: Topic of the conversation sessioncontext_length_in_tokens
: Total tokens in the contextcontext_length_in_letters
: Total English letters in the contextdistance_to_ref_in_blocks
: Blocks from question to most recent preference mentiondistance_to_ref_in_tokens
: Tokens from question to most recent preference mentionnum_irrelevant_tokens
: Tokens from irrelevant interactionsdistance_to_ref_proportion_in_context
: Proportional position of latest preference in contextuser_question_or_message
correct_answer
all_options
: list of all answer choices presented for this questionshared_context_id
: Key to retrieve full context fromshared_contexts_[SIZE].jsonl
end_index_in_shared_context
: Use to slice the loaded context ascontext[:int(end_index_in_shared_context)]
Each shared_contexts_[SIZE].jsonl
file is a JSONL-formatted list of API dicts of user–model interaction sequences.
🚨 We evaluate 15 state-of-the-art LLMs, including GPT-4.5, GPT-4.1, o4-mini, o3-mini, o1, Llama-4, DeepSeek-R1, Gemini-2, Gemini-1.5, Claude-3.7, and Claude-3.5, across 7 in-situ query types. While they could perform well at recalling user facts and preferences, they still struggle at providing novel suggestions, or applying users’ preferences in new scenarios.
🚨 We also rank these LLMs from top to bottom based on their performance as the number of sessions increases since the most recent preference was mentioned in the long context. Top: up to 20 sessions/128k tokens; Bottom: up to 60 sessions/1M tokens. GPT-4.5, GPT-4.1, and Gemini-1.5 achieve the highest overall performance, however, their performance still hovers around 52% in a multiple-choice setting, highlighting substantial room for improvement. Notably, reasoning models such as o4-mini, o3-mini, o1, and DeepSeek-R1-607B do not demonstrate competitive advantages over non-reasoning models.
We use Python virtual environment. Please run the following commands to create a virtual environment and install all the requirements:
python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
Google Gemini models have conflicting dependencies with OpenAI models related to google-genai
and httpx
packages. To run Gemini models, we therefore recommend creating a separate Conda environment:
conda create -n persona_mem python=3.9
conda activate persona_mem
pip install -r requirements.txt
pip install -q -U google-genai
Before you begin, create a new folder named api_tokens/ in the root directory. This folder will store your API keys required to run the models.
-
Create API keys from the respective providers if you haven't already.
-
Inside the api_tokens/ folder, create the following text files depending on which models you plan to use. Paste your API key as plain text into the corresponding file:
openai_key.txt
– for OpenAI modelsgemini_key.txt
– for Google Gemini modelsclaude_key.txt
– for Anthropic Claude modelslambda_key.txt
– for models accessed via the Lambda Cloud API (e.g., Llama, DeepSeek, etc.)
We provide ready-to-use inference scripts in the scripts/ directory for evaluating the following models:
- OpenAI Models
- GPT-4.5:
inference_gpt_4p5_preview.sh
- o3-mini:
inference_o3_mini.sh
- o1:
inference_o1.sh
- o1-mini:
inference_o1_mini.sh
- GPT-4o:
inference_gpt_4o.sh
- GPT-4o-mini:
inference_gpt_4o_mini.sh
- GPT-4.5:
- Google Gemini Models
- Gemini-2.5-Pro:
inference_gemini_2p5_pro.sh
- Gemini-2.0-Flash:
inference_gemini_2p0_flash.sh
- Gemini-2.0-Flash-Lite:
inference_gemini_2p0_flash_lite.sh
- Gemini-1.5-Flash:
inference_gemini_1p5_flash.sh
- Gemini-2.5-Pro:
- Anthropic Claude Models
- Claude-3.7-Sonnet:
inference_claude_3p7_sonnet.sh
- Claude-3.5-Haiku:
inference_claude_3p5_haiku.sh
- Claude-3.7-Sonnet:
- Meta Llama Models
- Llama-4-Maverick:
inference_llama4_maverick.sh
- Llama-3.1-405B:
inference_llama_3p1_405b.sh
- Llama-4-Maverick:
- DeepSeek Models
- DeepSeek-R1-607B:
inference_deepseek_r1_671b.sh
- DeepSeek-R1-607B:
To run evaluation for a specific model, simply execute the corresponding script. For example:
bash scripts/inference_gpt_4o.sh
Each script supports benchmarking at different context window sizes. If the model allows, you can modify the BENCHMARK_SIZE
variable inside the script to 32k
, 128k
, or 1M
. Currently, only Gemini models and Llama-4 support context windows up to 1 million tokens.
Evaluation results will be automatically saved to the data/results/ directory.
If you would like to add support for additional models, refer to our implementation in inference.py
or inference_standalone_openai.py
for guidance. You only need to update the __init__
and query_llm
methods of the Evaluation
class.
Interested in how we built the conversation data? Keep reading!
We provide a script to automatically generate persona-based multi-session conversations. To run it:
bash scripts/run_all_prepare_data.sh
💡 Tip: If a data generation step fails, it's likely due to syntax issues in the LLM-generated response. Simply regenerate the data of that file.
We also allow command-line argparser for the following arguments inside the script:
--model
[str]: The LLM used for generation (e.g.,gpt-4o
).--topics
[str]: One or more conversation topics (space-separated for multiple).--n_persona
[int]: Total number of different personas to generate, specified by theend_persona_id
variable in the script.--s_persona
[int]: The starting index of all personas to generate, specified by thestart_persona_id
variable in the script.--output_dir
[str]: Directory where generated data will be saved.--clean
[store_true] Remove existing data files and start clean.--verbose
[store_true]: Print all generated content to the console.You only need to specify integer values for
end_persona_id
andstart_persona_id
. A total ofend_persona_id - start_persona_id
random personas will be created automatically. Data of different topics under the samepersona_id
will always share the same persona.Example: Generate Conversations for a Single Topic
python prepare_data.py --model gpt-4o --context therapy --output_dir data/output/ --verboseExample: Generate Conversations for Multiple Topics
python prepare_data.py --model gpt-4o --topics therapy travelPlanning foodRecommendation --output_dir data/output/ --verboseWe currently include 18 diverse conversation topics: -
bookRecommendation
,coding
,datingConsultation
,familyRelations
,financialConsultation
,foodRecommendation
,homeDecoration
,legalConsultation
,medicalConsultation
,movieRecommendation
,musicRecommendation
,onlineShopping
,sportsRecommendation
,studyConsultation
,therapy
,travelPlanning
,writing
. Feel free to experiment by specifying a new topic name in the command line.
We provide a script to continue to generate question-answering pairs. To run it:
bash scripts/run_all_prepare_qa.sh
We also allow command-line argparser for the following arguments inside the script:
--model
[str]: The LLM used for generation (e.g.,gpt-4o
).--action
[str]: Defaultqa
to generate question-answering pairs.--topics
[str]: One or more conversation topics (space-separated for multiple).--n_persona
[int]: Total number of different personas to generate, specified byend_persona_id
in the script.--s_persona
[int]: The starting index of all personas to generate, specified bystart_persona_id
in the script.--time
[str]: A list of time periods selected frominit
,next_week
,next_month
, andnext_year
, specified by thetime_periods
variable in the script.--clean
[store_true] Remove existing data files and start clean.--verbose
[store_true]: Print all generated content to the console.Example: Generate Question-Answering Pairs for Multiple Topics
python prepare_data.py --model gpt-4o --action qa --topics therapy travelPlanning foodRecommendation --time init --verbose
🧩 Now we have conversations and Q&A pairs for each conversation session. Let’s concatenate them to form the full interaction history.
We provide a script to continue to generate question-answering pairs. To run it, for example:
bash scripts/run_generate_benchmark.sh large
The context length is determined by the argument you pass to the script:
small
→ up to 32k tokensmedium
→ up to 128k tokenslarge
→ up to 1M tokens
We also allow command-line argparser for the following arguments inside the script:
--model
[str]: The LLM used for filtering low-quality questions (e.g.,gpt-4o-mini
).--step
[str]: Defaultprepare
to generate benchmark contexts.--idx_persona
[int]: The index of the persona for which the context is constructed, specified bystart_persona_id
andend_persona_id
in the script.--n_blocks
[int]: Total number of conversation sessions to concatenate. This is set automatically when using small, medium, or large.--n_variants
[int]: Number of different topological variants (orderings) of conversation sessions to concatenate.--filter_questions
[store_true]: Use an LLM to remove questions that can be answered directly without seeing context.--clean
[store_true] Remove existing data files and start clean.--verbose
[store_true]: Print all generated content to the console.Example: Generate Full Context for One Persona
python inference.py --step prepare --model gpt-4o-mini --idx_persona 0 --n_blocks 60 --n_variants 2 --filter_questions --clean --verbose