-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a Vision Model with Text-only Inputs #1590
Comments
Apologies on the issue - I'm still trying to work out the best way for people to do text only (or a mixture of text + images) - for now your code looks correct! |
Any updates? Does text-only already work? |
This approach above works in Qwen based on my tests. |
Great to hear that! Will try this later! Will something like this code (raw text in JSONL) work? #1505 (comment) |
If the text is already in the expected structure for the model with the system, user, and assistant tags, I believe it will work. I trained it using ShareGPT and Qwen2VL because 2.5VL is not supported in VLLM. Edit: If you want to check the datasets used, I had to modify the structure a bit to support the {"conversations":{[
{ "role": "system", "content": [ { "type": "text", "text": system_content } ] },
{ "role": "user", "content": [ { "type": "text", "text": user_content } ] },
{ "role": "assistant", "content": [ { "type": "text", "text": assistant_content } ] }
]} def collate_fn(examples):
# Extrair as mensagens do formato correto
processed_examples = [example['conversations'] for example in examples]
# Aplicar o template de chat para cada exemplo
texts = [processor.apply_chat_template(messages, tokenize=False)
for messages in processed_examples]
# Tokenize os textos
batch = processor(
text=texts,
return_tensors="pt",
padding=True
)
# Criar labels a partir dos input_ids
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch |
Yey! Time to do some magic! By the way, is Qwen2.5-VL 7B supported for RLHF and Full Training, too, or only LoRA/QLoRA? Also, I would like to do Question and Right/Wrong RHLF. It is called DPO, right? But where there so many others? Difference? |
Unsloth is great also because it's hackable! So I tested making the changes above however I wanted. As for the multiple RLHF methods on Hugging Face, there is a brief explanation about each method and their differences. |
Hey! Just tried again. New error: #1651 Do you know how to fix this? |
I need to train the vision model using only text inputs. I tried using Colab notebooks but noticed that images are mandatory in the data. After researching a bit more, I found a Colab notebook that trains with images, and I modified it. The code now looks like this:
I am not entirely sure if this is 100% correct, but it seems to work (it's training).
Is there a proper way to train using only text datasets?
Additionally, can someone suggest a way to train on tools using the original chat template from LLaMA? I couldn’t understand how to structure the dataset so it works with the tokenizer.
The text was updated successfully, but these errors were encountered: