Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a Vision Model with Text-only Inputs #1590

Open
jjhoow opened this issue Jan 28, 2025 · 8 comments
Open

Training a Vision Model with Text-only Inputs #1590

jjhoow opened this issue Jan 28, 2025 · 8 comments

Comments

@jjhoow
Copy link

jjhoow commented Jan 28, 2025

I need to train the vision model using only text inputs. I tried using Colab notebooks but noticed that images are mandatory in the data. After researching a bit more, I found a Colab notebook that trains with images, and I modified it. The code now looks like this:

# Create a data collator to encode text and image pairs
def collate_fn(examples):
    # Extract the messages in the correct format
    processed_examples = [example['messages'] for example in examples]
    
    # Apply the chat template to each example
    texts = [tokenizer.apply_chat_template(messages, tokenize=False) 
            for messages in processed_examples]

    # Tokenize the texts
    batch = tokenizer(
        text=texts, 
        images=None, 
        return_tensors="pt", 
        padding=True
    )

    # Create labels from the input_ids
    labels = batch["input_ids"].clone()
    labels[labels == tokenizer.tokenizer.pad_token_id] = -100

    batch["labels"] = labels

    return batch

from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # Enable for training!

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = collate_fn, # Must use!
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 120,
        num_train_epochs = 1, # Set this instead of max_steps for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # For Weights and Biases

        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

I am not entirely sure if this is 100% correct, but it seems to work (it's training).

Is there a proper way to train using only text datasets?

Additionally, can someone suggest a way to train on tools using the original chat template from LLaMA? I couldn’t understand how to structure the dataset so it works with the tokenizer.

@danielhanchen
Copy link
Contributor

Apologies on the issue - I'm still trying to work out the best way for people to do text only (or a mixture of text + images) - for now your code looks correct!

@yukiarimo
Copy link

Any updates? Does text-only already work?

@jjhoow
Copy link
Author

jjhoow commented Feb 6, 2025

Any updates? Does text-only already work?

This approach above works in Qwen based on my tests.

@yukiarimo
Copy link

Great to hear that! Will try this later!

Will something like this code (raw text in JSONL) work? #1505 (comment)

@jjhoow
Copy link
Author

jjhoow commented Feb 6, 2025

If the text is already in the expected structure for the model with the system, user, and assistant tags, I believe it will work. I trained it using ShareGPT and Qwen2VL because 2.5VL is not supported in VLLM.

Edit:
There's one detail: this is my model: JJhooww/Fluxi_AI_Small_Vision.

If you want to check the datasets used, I had to modify the structure a bit to support the "type": "text", "text": {content} format.

{"conversations":{[
  { "role": "system", "content": [ { "type": "text", "text": system_content } ] },
  { "role": "user", "content": [ { "type": "text", "text": user_content } ] },
  { "role": "assistant", "content": [ { "type": "text", "text": assistant_content } ] }
]}
def collate_fn(examples):
    # Extrair as mensagens do formato correto
    processed_examples = [example['conversations'] for example in examples]
    
    # Aplicar o template de chat para cada exemplo
    texts = [processor.apply_chat_template(messages, tokenize=False) 
            for messages in processed_examples]

    # Tokenize os textos
    batch = processor(
        text=texts, 
        return_tensors="pt", 
        padding=True
    )

    # Criar labels a partir dos input_ids
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100

    batch["labels"] = labels

    return batch

@yukiarimo
Copy link

Yey! Time to do some magic! By the way, is Qwen2.5-VL 7B supported for RLHF and Full Training, too, or only LoRA/QLoRA?

Also, I would like to do Question and Right/Wrong RHLF. It is called DPO, right? But where there so many others? Difference?

@jjhoow
Copy link
Author

jjhoow commented Feb 6, 2025

Unsloth is great also because it's hackable! So I tested making the changes above however I wanted. As for the multiple RLHF methods on Hugging Face, there is a brief explanation about each method and their differences.

@yukiarimo
Copy link

Hey! Just tried again. New error: #1651

Do you know how to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants